You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by ka...@nokia.com on 2010/04/20 14:25:06 UTC

FW: Solr and LCF security at query time

FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr'
Cc: 'solr-dev@apache.org'; 'connectors-dev@incubator.apache.org'; 'connectors-user@incubator.apache.org'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org; connectors-dev@incubator.apache.org
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

The LCF solr output connector uses the Solr http API to post in documents and metadata into the Solr pipeline and index.  What I meant by that was that I will modify the Solr output connector to also transmit the access tokens to the Solr http API as additional document metadata.

Thanks,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 6:51 PM
To: connectors-user@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique




RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,
You should be able to use LCF authorities for your purposes.  I'm less clear about what you mean by the "interface into decoupled acl storage".  Existing repository connectors are not aware of any decoupled storage, and if you were to adopt the LCF model in its entirety, you've defeated your purpose, sounds to me.

 Of course, if the model is an acl store which simply refers to access tokens that might come back from the authorities, then there would be no need to modify connectors or LCF at all - just modify your plug-in to talk to the appropriate lcf authority service, and be done with it.  Of course, then your acl file would have to contain access tokens exactly the way LCF authorities will construct them, or it won't work.
Karl



________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 29, 2010 11:04 AM
To: dev@lucene.apache.org
Subject: Re: FW: Solr and LCF security at query time

Yes, your point about filters vs queries is a good one. I do need to move the fq building to the Lucene model.

It's true that in the case of, say, NTFS, there is already access control built-in to the source files.
The differences are, as you pointed out, the ones that don't have this, and also Solr indexes that hold multiple types of data (a bit of NTFS, a bit of web, some rss etc.). It's probably true to say that most Solr indexes today contain at least some data that has no intrinsic security built in to its source.

I can see from an LCF perspective, that the proposed model fits in with it, but is LCF really an 'all or nothing' framework with regards repositories? I guess, not coming from the LCF side, but from the 'generic data' side of things, I thought LCF would work for 1) the authority side of things (i.e. an interface into AD et al.) and 2) possibly an interface into decoupled acl storage - i.e. the decoupled data, whether stored in a file, another Solr cole, a sql db or whatever, would become the 'repository' - with perhaps the difference that it holds user->search acl rather than user->file acl.

Would LCF work in this way, or would it simply be too much work to make it practical?


Thanks,
Peter


On Thu, Apr 29, 2010 at 3:45 PM, <ka...@nokia.com>> wrote:
If we aren't talking about a repository of some kind, then we aren't talking about using LCF.  If your design point is about applying security to NFS via an acl-xml file, your uploaded contribution will do that just fine (although I think you might want to use Filters in some places you are currently using Querys, according to what I've learned over the past day or two).

If a repository with security is involved, there's no benefit I can see to building yet another security mechanism above and beyond the one that the repository would provide.  It's double the administration, and in that light only makes sense at all if there's no native security mechanism present in whatever your data source is.  There are certainly a number of "repositories" with this characteristic, though - the web, rss feeds, file systems, etc.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 29, 2010 9:56 AM

To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

- There's a significant extra load on the repository, because every search result has to be checked against the repository in real time

By repository, do you mean, for example, NTFS? You certainly wouldn't want, or need to do that at all, particularly for environments where the repository isn't available. That's kind of the point of having the acl decoupled.

- It will perform very poorly on queries were there are a lot of matching documents, but the search user can't see most of them

The performance of the filter queries would be no worse (or better) than any other of similar length/complexity. Essentially, the filter queries between the two models are just using a different set of attributes (acl-specific vs. intrinsic to the document). If someone felt they needed to build lots of super-long complex filter queries to define a set of allowed/denied documents, their general search performance is probably not going to be great anyway, and would be remedied by organizing the data more efficiently (which is a good idea in any case).


Thanks,
Peter


On Thu, Apr 29, 2010 at 1:10 PM, <ka...@nokia.com>> wrote:
Putting access control lookup at search-result time has the following benefits:

- It sees changes right away, when the underlying repository changes

Here are the drawbacks, as far as I can see:

- There's a significant extra load on the repository, because every search result has to be checked against the repository in real time
- It will perform very poorly on queries were there are a lot of matching documents, but the search user can't see most of them

Having only one general solution means that you have to pick one or the other of the two models.  We opted for the model we did because the drawbacks were potentially severe, especially under conditions of high demand.  The repository load question is not a trivial one, because it scales as the number of results returned, which is a potentially gigantic number.

However, I am perfectly fine with supporting both models.  Your suggested solution will work for some classes of problem.  It seems to me that in order to support it you will need a parallel infrastructure to do that.  We could develop that infrastructure within LCF, but it's a bit of work to do:

(1) Output an "internal repository document security identifier" into the index, in addition to tokens.  This id is not the same at all as the document's URI, which is what literal.id<http://literal.id> is currently set to, so a new solr schema field would need to be made for this.  All output connectors would need to be modified to do this, and all repository connectors as well.
(2) Since the security identifier would be valid within the context of a given repository connection, the "authority service" code that tries to verify visibility of a document given the authenticated user name and security identifier would need to look up the correct repository connection and call a method within it - which currently doesn't exist.  So we'd need to write such a method for all connectors that have security.
(3) Since this service would have a high load, and only be used under one particular model, I'd suggest actually defining a whole new webapp for it, so it can be distributed/controlled independently.

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 29, 2010 5:35 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Cc: dev@lucene.apache.org<ma...@lucene.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I guess it comes down to - any solution is ultimately going to place access control on a search and not on data, so there isn't much to be gained by binding the access control to the data. Whatever attributes exist at index time to build an acl will still be there at query time, so by making the acl search-bound, the acl is decoupled from the data, allowing it to be used in any use case scenario.

Here's a typical sampling of use cases where the decoupling of acl from data is required:

One customer has a  'shop-search' requirement where, logged-in users' access to various shops changes daily, sometimes 4 or 5 times a day. There are several hundred such shops and 10s of millions of documents, and the indexing part doesn't have ownership of any of the 'source' documents.

Another example is a customer who has multiple sites and multiple AD domains. They have one domain for the UK, but a completely separate domain for Gibraltar. When data is replicated to  remote servers accessed by Gibraltar staff, these users have no SID information in the other domain.

An 'interesting' example of this at the extreme is 34rkl4ys Bank, where, due to departmental history, they have no fewer than 85 AD domains! This of course is a nightmare in itself, but trying to tie access information to data at storage time is virtually impossible in this environment.

The thing I'm trying to understand is that the decoupled approach works equally well for the requirements where you do have acl information at index time. I guess I'm not understanding the advantages to making schema changes and binding acl to data, when there's really no need. I particularly like your idea of using LCF as the facilitator of storing/retrieving such decoupled data (as opposed to just an xml file). It sounds like there's even a user interface for 'non-technical' staff to make acl configuration changes. That's really cool, and ultimately an elegant solution that will fit present and future needs.


Kind regards,
Peter


On Thu, Apr 29, 2010 at 1:24 AM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm more than happy to hear your customer's requirements, so no problem there.  It does seem to me that they are a bit different than what I've seen.  I think there is plenty of room for different flavors of solution, so please by all means go ahead and propose your take on it!

Karl

________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Wednesday, April 28, 2010 8:07 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite - to highlight requirements that have found to be necessary for customers/users, and to hopefully get the best functionality for the product. If you feel I've put you out on any of the issues raised, then I apologize for that, it was certainly not my intention.

Peter





Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Yes, your point about filters vs queries is a good one. I do need to move
the fq building to the Lucene model.

It's true that in the case of, say, NTFS, there is already access control
built-in to the source files.
The differences are, as you pointed out, the ones that don't have this, and
also Solr indexes that hold multiple types of data (a bit of NTFS, a bit of
web, some rss etc.). It's probably true to say that most Solr indexes today
contain at least some data that has no intrinsic security built in to its
source.

I can see from an LCF perspective, that the proposed model fits in with it,
but is LCF really an 'all or nothing' framework with regards repositories? I
guess, not coming from the LCF side, but from the 'generic data' side of
things, I thought LCF would work for 1) the authority side of things (i.e.
an interface into AD et al.) and 2) possibly an interface into decoupled acl
storage - i.e. the decoupled data, whether stored in a file, another Solr
cole, a sql db or whatever, would become the 'repository' - with perhaps the
difference that it holds user->search acl rather than user->file acl.

Would LCF work in this way, or would it simply be too much work to make it
practical?


Thanks,
Peter


On Thu, Apr 29, 2010 at 3:45 PM, <ka...@nokia.com> wrote:

>  If we aren't talking about a repository of some kind, then we aren't
> talking about using LCF.  If your design point is about applying security to
> NFS via an acl-xml file, your uploaded contribution will do that just fine
> (although I think you might want to use Filters in some places you are
> currently using Querys, according to what I've learned over the past day or
> two).
>
> If a repository with security is involved, there's no benefit I can see to
> building yet another security mechanism above and beyond the one that the
> repository would provide.  It's double the administration, and in that light
> only makes sense at all if there's no native security mechanism present in
> whatever your data source is.  There are certainly a number of
> "repositories" with this characteristic, though - the web, rss feeds, file
> systems, etc.
>
> Karl
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Thursday, April 29, 2010 9:56 AM
>
> *To:* dev@lucene.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
>  Hi Karl,
>
> - There's a significant extra load on the repository, because every search
> result has to be checked against the repository in real time
>
> By repository, do you mean, for example, NTFS? You certainly wouldn't want,
> or need to do that at all, particularly for environments where the
> repository isn't available. That's kind of the point of having the acl
> decoupled.
>
> - It will perform very poorly on queries were there are a lot of matching
> documents, but the search user can't see most of them
>
> The performance of the filter queries would be no worse (or better) than
> any other of similar length/complexity. Essentially, the filter queries
> between the two models are just using a different set of attributes
> (acl-specific vs. intrinsic to the document). If someone felt they needed to
> build lots of super-long complex filter queries to define a set of
> allowed/denied documents, their general search performance is probably not
> going to be great anyway, and would be remedied by organizing the data more
> efficiently (which is a good idea in any case).
>
>
> Thanks,
> Peter
>
>
> On Thu, Apr 29, 2010 at 1:10 PM, <ka...@nokia.com> wrote:
>
>>  Putting access control lookup at search-result time has the following
>> benefits:
>>
>> - It sees changes right away, when the underlying repository changes
>>
>> Here are the drawbacks, as far as I can see:
>>
>> - There's a significant extra load on the repository, because every search
>> result has to be checked against the repository in real time
>> - It will perform very poorly on queries were there are a lot of matching
>> documents, but the search user can't see most of them
>>
>> Having only one general solution means that you have to pick one or the
>> other of the two models.  We opted for the model we did because the
>> drawbacks were potentially severe, especially under conditions of high
>> demand.  The repository load question is not a trivial one, because it
>> scales as the number of results returned, which is a potentially gigantic
>> number.
>>
>> However, I am perfectly fine with supporting both models.  Your suggested
>> solution will work for some classes of problem.  It seems to me that in
>> order to support it you will need a parallel infrastructure to do that.  We
>> could develop that infrastructure within LCF, but it's a bit of work to do:
>>
>> (1) Output an "internal repository document security identifier" into the
>> index, in addition to tokens.  This id is not the same at all as the
>> document's URI, which is what literal.id is currently set to, so a new
>> solr schema field would need to be made for this.  All output connectors
>> would need to be modified to do this, and all repository connectors as well.
>> (2) Since the security identifier would be valid within the context of a
>> given repository connection, the "authority service" code that tries to
>> verify visibility of a document given the authenticated user name and
>> security identifier would need to look up the correct repository connection
>> and call a method within it - which currently doesn't exist.  So we'd need
>> to write such a method for all connectors that have security.
>> (3) Since this service would have a high load, and only be used under one
>> particular model, I'd suggest actually defining a whole new webapp for it,
>> so it can be distributed/controlled independently.
>>
>> Karl
>>
>>
>>  ------------------------------
>> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> *Sent:* Thursday, April 29, 2010 5:35 AM
>> *To:* connectors-user@incubator.apache.org
>> *Cc:* dev@lucene.apache.org; connectors-dev@incubator.apache.org;
>> lucene-dev@apache.org
>>
>> *Subject:* Re: FW: Solr and LCF security at query time
>>
>>  Hi Karl,
>>
>> I guess it comes down to - any solution is ultimately going to place
>> access control on a search and not on data, so there isn't much to be gained
>> by binding the access control to the data. Whatever attributes exist at
>> index time to build an acl will still be there at query time, so by making
>> the acl search-bound, the acl is decoupled from the data, allowing it to be
>> used in any use case scenario.
>>
>> Here's a typical sampling of use cases where the decoupling of acl from
>> data is required:
>>
>> One customer has a  'shop-search' requirement where, logged-in users'
>> access to various shops changes daily, sometimes 4 or 5 times a day. There
>> are several hundred such shops and 10s of millions of documents, and the
>> indexing part doesn't have ownership of any of the 'source' documents.
>>
>> Another example is a customer who has multiple sites and multiple AD
>> domains. They have one domain for the UK, but a completely separate domain
>> for Gibraltar. When data is replicated to  remote servers accessed by
>> Gibraltar staff, these users have no SID information in the other domain.
>>
>> An 'interesting' example of this at the extreme is 34rkl4ys Bank, where,
>> due to departmental history, they have no fewer than 85 AD domains! This of
>> course is a nightmare in itself, but trying to tie access information to
>> data at storage time is virtually impossible in this environment.
>>
>> The thing I'm trying to understand is that the decoupled approach works
>> equally well for the requirements where you do have acl information at index
>> time. I guess I'm not understanding the advantages to making schema changes
>> and binding acl to data, when there's really no need. I particularly like
>> your idea of using LCF as the facilitator of storing/retrieving such
>> decoupled data (as opposed to just an xml file). It sounds like there's even
>> a user interface for 'non-technical' staff to make acl configuration
>> changes. That's really cool, and ultimately an elegant solution that will
>> fit present and future needs.
>>
>>
>> Kind regards,
>> Peter
>>
>>
>> On Thu, Apr 29, 2010 at 1:24 AM, <ka...@nokia.com> wrote:
>>
>>> Hi Peter,
>>>
>>> I'm more than happy to hear your customer's requirements, so no problem
>>> there.  It does seem to me that they are a bit different than what I've
>>> seen.  I think there is plenty of room for different flavors of solution, so
>>> please by all means go ahead and propose your take on it!
>>>
>>> Karl
>>>
>>> ________________________________________
>>> From: ext Peter Sturge [peter.sturge@googlemail.com]
>>> Sent: Wednesday, April 28, 2010 8:07 PM
>>> To: dev@lucene.apache.org
>>> Cc: connectors-user@incubator.apache.org;
>>> connectors-dev@incubator.apache.org; lucene-dev@apache.org
>>> Subject: Re: FW: Solr and LCF security at query time
>>>
>>>  Hi Karl,
>>>
>>> I wasn't trying to to put pay to your design proposal, really the
>>> opposite - to highlight requirements that have found to be necessary for
>>> customers/users, and to hopefully get the best functionality for the
>>> product. If you feel I've put you out on any of the issues raised, then I
>>> apologize for that, it was certainly not my intention.
>>>
>>> Peter
>>>
>>>
>>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
If we aren't talking about a repository of some kind, then we aren't talking about using LCF.  If your design point is about applying security to NFS via an acl-xml file, your uploaded contribution will do that just fine (although I think you might want to use Filters in some places you are currently using Querys, according to what I've learned over the past day or two).

If a repository with security is involved, there's no benefit I can see to building yet another security mechanism above and beyond the one that the repository would provide.  It's double the administration, and in that light only makes sense at all if there's no native security mechanism present in whatever your data source is.  There are certainly a number of "repositories" with this characteristic, though - the web, rss feeds, file systems, etc.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 29, 2010 9:56 AM
To: dev@lucene.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

- There's a significant extra load on the repository, because every search result has to be checked against the repository in real time

By repository, do you mean, for example, NTFS? You certainly wouldn't want, or need to do that at all, particularly for environments where the repository isn't available. That's kind of the point of having the acl decoupled.

- It will perform very poorly on queries were there are a lot of matching documents, but the search user can't see most of them

The performance of the filter queries would be no worse (or better) than any other of similar length/complexity. Essentially, the filter queries between the two models are just using a different set of attributes (acl-specific vs. intrinsic to the document). If someone felt they needed to build lots of super-long complex filter queries to define a set of allowed/denied documents, their general search performance is probably not going to be great anyway, and would be remedied by organizing the data more efficiently (which is a good idea in any case).


Thanks,
Peter


On Thu, Apr 29, 2010 at 1:10 PM, <ka...@nokia.com>> wrote:
Putting access control lookup at search-result time has the following benefits:

- It sees changes right away, when the underlying repository changes

Here are the drawbacks, as far as I can see:

- There's a significant extra load on the repository, because every search result has to be checked against the repository in real time
- It will perform very poorly on queries were there are a lot of matching documents, but the search user can't see most of them

Having only one general solution means that you have to pick one or the other of the two models.  We opted for the model we did because the drawbacks were potentially severe, especially under conditions of high demand.  The repository load question is not a trivial one, because it scales as the number of results returned, which is a potentially gigantic number.

However, I am perfectly fine with supporting both models.  Your suggested solution will work for some classes of problem.  It seems to me that in order to support it you will need a parallel infrastructure to do that.  We could develop that infrastructure within LCF, but it's a bit of work to do:

(1) Output an "internal repository document security identifier" into the index, in addition to tokens.  This id is not the same at all as the document's URI, which is what literal.id<http://literal.id> is currently set to, so a new solr schema field would need to be made for this.  All output connectors would need to be modified to do this, and all repository connectors as well.
(2) Since the security identifier would be valid within the context of a given repository connection, the "authority service" code that tries to verify visibility of a document given the authenticated user name and security identifier would need to look up the correct repository connection and call a method within it - which currently doesn't exist.  So we'd need to write such a method for all connectors that have security.
(3) Since this service would have a high load, and only be used under one particular model, I'd suggest actually defining a whole new webapp for it, so it can be distributed/controlled independently.

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 29, 2010 5:35 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Cc: dev@lucene.apache.org<ma...@lucene.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I guess it comes down to - any solution is ultimately going to place access control on a search and not on data, so there isn't much to be gained by binding the access control to the data. Whatever attributes exist at index time to build an acl will still be there at query time, so by making the acl search-bound, the acl is decoupled from the data, allowing it to be used in any use case scenario.

Here's a typical sampling of use cases where the decoupling of acl from data is required:

One customer has a  'shop-search' requirement where, logged-in users' access to various shops changes daily, sometimes 4 or 5 times a day. There are several hundred such shops and 10s of millions of documents, and the indexing part doesn't have ownership of any of the 'source' documents.

Another example is a customer who has multiple sites and multiple AD domains. They have one domain for the UK, but a completely separate domain for Gibraltar. When data is replicated to  remote servers accessed by Gibraltar staff, these users have no SID information in the other domain.

An 'interesting' example of this at the extreme is 34rkl4ys Bank, where, due to departmental history, they have no fewer than 85 AD domains! This of course is a nightmare in itself, but trying to tie access information to data at storage time is virtually impossible in this environment.

The thing I'm trying to understand is that the decoupled approach works equally well for the requirements where you do have acl information at index time. I guess I'm not understanding the advantages to making schema changes and binding acl to data, when there's really no need. I particularly like your idea of using LCF as the facilitator of storing/retrieving such decoupled data (as opposed to just an xml file). It sounds like there's even a user interface for 'non-technical' staff to make acl configuration changes. That's really cool, and ultimately an elegant solution that will fit present and future needs.


Kind regards,
Peter


On Thu, Apr 29, 2010 at 1:24 AM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm more than happy to hear your customer's requirements, so no problem there.  It does seem to me that they are a bit different than what I've seen.  I think there is plenty of room for different flavors of solution, so please by all means go ahead and propose your take on it!

Karl

________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Wednesday, April 28, 2010 8:07 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite - to highlight requirements that have found to be necessary for customers/users, and to hopefully get the best functionality for the product. If you feel I've put you out on any of the issues raised, then I apologize for that, it was certainly not my intention.

Peter




Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

- There's a significant extra load on the repository, because every search
result has to be checked against the repository in real time

By repository, do you mean, for example, NTFS? You certainly wouldn't want,
or need to do that at all, particularly for environments where the
repository isn't available. That's kind of the point of having the acl
decoupled.

- It will perform very poorly on queries were there are a lot of matching
documents, but the search user can't see most of them

The performance of the filter queries would be no worse (or better) than any
other of similar length/complexity. Essentially, the filter queries between
the two models are just using a different set of attributes (acl-specific
vs. intrinsic to the document). If someone felt they needed to build lots of
super-long complex filter queries to define a set of allowed/denied
documents, their general search performance is probably not going to be
great anyway, and would be remedied by organizing the data more efficiently
(which is a good idea in any case).


Thanks,
Peter


On Thu, Apr 29, 2010 at 1:10 PM, <ka...@nokia.com> wrote:

>  Putting access control lookup at search-result time has the following
> benefits:
>
> - It sees changes right away, when the underlying repository changes
>
> Here are the drawbacks, as far as I can see:
>
> - There's a significant extra load on the repository, because every search
> result has to be checked against the repository in real time
> - It will perform very poorly on queries were there are a lot of matching
> documents, but the search user can't see most of them
>
> Having only one general solution means that you have to pick one or the
> other of the two models.  We opted for the model we did because the
> drawbacks were potentially severe, especially under conditions of high
> demand.  The repository load question is not a trivial one, because it
> scales as the number of results returned, which is a potentially gigantic
> number.
>
> However, I am perfectly fine with supporting both models.  Your suggested
> solution will work for some classes of problem.  It seems to me that in
> order to support it you will need a parallel infrastructure to do that.  We
> could develop that infrastructure within LCF, but it's a bit of work to do:
>
> (1) Output an "internal repository document security identifier" into the
> index, in addition to tokens.  This id is not the same at all as the
> document's URI, which is what literal.id is currently set to, so a new
> solr schema field would need to be made for this.  All output connectors
> would need to be modified to do this, and all repository connectors as well.
> (2) Since the security identifier would be valid within the context of a
> given repository connection, the "authority service" code that tries to
> verify visibility of a document given the authenticated user name and
> security identifier would need to look up the correct repository connection
> and call a method within it - which currently doesn't exist.  So we'd need
> to write such a method for all connectors that have security.
> (3) Since this service would have a high load, and only be used under one
> particular model, I'd suggest actually defining a whole new webapp for it,
> so it can be distributed/controlled independently.
>
> Karl
>
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Thursday, April 29, 2010 5:35 AM
> *To:* connectors-user@incubator.apache.org
> *Cc:* dev@lucene.apache.org; connectors-dev@incubator.apache.org;
> lucene-dev@apache.org
>
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> I guess it comes down to - any solution is ultimately going to place access
> control on a search and not on data, so there isn't much to be gained by
> binding the access control to the data. Whatever attributes exist at index
> time to build an acl will still be there at query time, so by making the acl
> search-bound, the acl is decoupled from the data, allowing it to be used in
> any use case scenario.
>
> Here's a typical sampling of use cases where the decoupling of acl from
> data is required:
>
> One customer has a  'shop-search' requirement where, logged-in users'
> access to various shops changes daily, sometimes 4 or 5 times a day. There
> are several hundred such shops and 10s of millions of documents, and the
> indexing part doesn't have ownership of any of the 'source' documents.
>
> Another example is a customer who has multiple sites and multiple AD
> domains. They have one domain for the UK, but a completely separate domain
> for Gibraltar. When data is replicated to  remote servers accessed by
> Gibraltar staff, these users have no SID information in the other domain.
>
> An 'interesting' example of this at the extreme is 34rkl4ys Bank, where,
> due to departmental history, they have no fewer than 85 AD domains! This of
> course is a nightmare in itself, but trying to tie access information to
> data at storage time is virtually impossible in this environment.
>
> The thing I'm trying to understand is that the decoupled approach works
> equally well for the requirements where you do have acl information at index
> time. I guess I'm not understanding the advantages to making schema changes
> and binding acl to data, when there's really no need. I particularly like
> your idea of using LCF as the facilitator of storing/retrieving such
> decoupled data (as opposed to just an xml file). It sounds like there's even
> a user interface for 'non-technical' staff to make acl configuration
> changes. That's really cool, and ultimately an elegant solution that will
> fit present and future needs.
>
>
> Kind regards,
> Peter
>
>
> On Thu, Apr 29, 2010 at 1:24 AM, <ka...@nokia.com> wrote:
>
>> Hi Peter,
>>
>> I'm more than happy to hear your customer's requirements, so no problem
>> there.  It does seem to me that they are a bit different than what I've
>> seen.  I think there is plenty of room for different flavors of solution, so
>> please by all means go ahead and propose your take on it!
>>
>> Karl
>>
>> ________________________________________
>> From: ext Peter Sturge [peter.sturge@googlemail.com]
>> Sent: Wednesday, April 28, 2010 8:07 PM
>> To: dev@lucene.apache.org
>> Cc: connectors-user@incubator.apache.org;
>> connectors-dev@incubator.apache.org; lucene-dev@apache.org
>> Subject: Re: FW: Solr and LCF security at query time
>>
>>  Hi Karl,
>>
>> I wasn't trying to to put pay to your design proposal, really the opposite
>> - to highlight requirements that have found to be necessary for
>> customers/users, and to hopefully get the best functionality for the
>> product. If you feel I've put you out on any of the issues raised, then I
>> apologize for that, it was certainly not my intention.
>>
>> Peter
>>
>>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Putting access control lookup at search-result time has the following benefits:

- It sees changes right away, when the underlying repository changes

Here are the drawbacks, as far as I can see:

- There's a significant extra load on the repository, because every search result has to be checked against the repository in real time
- It will perform very poorly on queries were there are a lot of matching documents, but the search user can't see most of them

Having only one general solution means that you have to pick one or the other of the two models.  We opted for the model we did because the drawbacks were potentially severe, especially under conditions of high demand.  The repository load question is not a trivial one, because it scales as the number of results returned, which is a potentially gigantic number.

However, I am perfectly fine with supporting both models.  Your suggested solution will work for some classes of problem.  It seems to me that in order to support it you will need a parallel infrastructure to do that.  We could develop that infrastructure within LCF, but it's a bit of work to do:

(1) Output an "internal repository document security identifier" into the index, in addition to tokens.  This id is not the same at all as the document's URI, which is what literal.id is currently set to, so a new solr schema field would need to be made for this.  All output connectors would need to be modified to do this, and all repository connectors as well.
(2) Since the security identifier would be valid within the context of a given repository connection, the "authority service" code that tries to verify visibility of a document given the authenticated user name and security identifier would need to look up the correct repository connection and call a method within it - which currently doesn't exist.  So we'd need to write such a method for all connectors that have security.
(3) Since this service would have a high load, and only be used under one particular model, I'd suggest actually defining a whole new webapp for it, so it can be distributed/controlled independently.

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 29, 2010 5:35 AM
To: connectors-user@incubator.apache.org
Cc: dev@lucene.apache.org; connectors-dev@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I guess it comes down to - any solution is ultimately going to place access control on a search and not on data, so there isn't much to be gained by binding the access control to the data. Whatever attributes exist at index time to build an acl will still be there at query time, so by making the acl search-bound, the acl is decoupled from the data, allowing it to be used in any use case scenario.

Here's a typical sampling of use cases where the decoupling of acl from data is required:

One customer has a  'shop-search' requirement where, logged-in users' access to various shops changes daily, sometimes 4 or 5 times a day. There are several hundred such shops and 10s of millions of documents, and the indexing part doesn't have ownership of any of the 'source' documents.

Another example is a customer who has multiple sites and multiple AD domains. They have one domain for the UK, but a completely separate domain for Gibraltar. When data is replicated to  remote servers accessed by Gibraltar staff, these users have no SID information in the other domain.

An 'interesting' example of this at the extreme is 34rkl4ys Bank, where, due to departmental history, they have no fewer than 85 AD domains! This of course is a nightmare in itself, but trying to tie access information to data at storage time is virtually impossible in this environment.

The thing I'm trying to understand is that the decoupled approach works equally well for the requirements where you do have acl information at index time. I guess I'm not understanding the advantages to making schema changes and binding acl to data, when there's really no need. I particularly like your idea of using LCF as the facilitator of storing/retrieving such decoupled data (as opposed to just an xml file). It sounds like there's even a user interface for 'non-technical' staff to make acl configuration changes. That's really cool, and ultimately an elegant solution that will fit present and future needs.


Kind regards,
Peter


On Thu, Apr 29, 2010 at 1:24 AM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm more than happy to hear your customer's requirements, so no problem there.  It does seem to me that they are a bit different than what I've seen.  I think there is plenty of room for different flavors of solution, so please by all means go ahead and propose your take on it!

Karl

________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Wednesday, April 28, 2010 8:07 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite - to highlight requirements that have found to be necessary for customers/users, and to hopefully get the best functionality for the product. If you feel I've put you out on any of the issues raised, then I apologize for that, it was certainly not my intention.

Peter



RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Putting access control lookup at search-result time has the following benefits:

- It sees changes right away, when the underlying repository changes

Here are the drawbacks, as far as I can see:

- There's a significant extra load on the repository, because every search result has to be checked against the repository in real time
- It will perform very poorly on queries were there are a lot of matching documents, but the search user can't see most of them

Having only one general solution means that you have to pick one or the other of the two models.  We opted for the model we did because the drawbacks were potentially severe, especially under conditions of high demand.  The repository load question is not a trivial one, because it scales as the number of results returned, which is a potentially gigantic number.

However, I am perfectly fine with supporting both models.  Your suggested solution will work for some classes of problem.  It seems to me that in order to support it you will need a parallel infrastructure to do that.  We could develop that infrastructure within LCF, but it's a bit of work to do:

(1) Output an "internal repository document security identifier" into the index, in addition to tokens.  This id is not the same at all as the document's URI, which is what literal.id is currently set to, so a new solr schema field would need to be made for this.  All output connectors would need to be modified to do this, and all repository connectors as well.
(2) Since the security identifier would be valid within the context of a given repository connection, the "authority service" code that tries to verify visibility of a document given the authenticated user name and security identifier would need to look up the correct repository connection and call a method within it - which currently doesn't exist.  So we'd need to write such a method for all connectors that have security.
(3) Since this service would have a high load, and only be used under one particular model, I'd suggest actually defining a whole new webapp for it, so it can be distributed/controlled independently.

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 29, 2010 5:35 AM
To: connectors-user@incubator.apache.org
Cc: dev@lucene.apache.org; connectors-dev@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I guess it comes down to - any solution is ultimately going to place access control on a search and not on data, so there isn't much to be gained by binding the access control to the data. Whatever attributes exist at index time to build an acl will still be there at query time, so by making the acl search-bound, the acl is decoupled from the data, allowing it to be used in any use case scenario.

Here's a typical sampling of use cases where the decoupling of acl from data is required:

One customer has a  'shop-search' requirement where, logged-in users' access to various shops changes daily, sometimes 4 or 5 times a day. There are several hundred such shops and 10s of millions of documents, and the indexing part doesn't have ownership of any of the 'source' documents.

Another example is a customer who has multiple sites and multiple AD domains. They have one domain for the UK, but a completely separate domain for Gibraltar. When data is replicated to  remote servers accessed by Gibraltar staff, these users have no SID information in the other domain.

An 'interesting' example of this at the extreme is 34rkl4ys Bank, where, due to departmental history, they have no fewer than 85 AD domains! This of course is a nightmare in itself, but trying to tie access information to data at storage time is virtually impossible in this environment.

The thing I'm trying to understand is that the decoupled approach works equally well for the requirements where you do have acl information at index time. I guess I'm not understanding the advantages to making schema changes and binding acl to data, when there's really no need. I particularly like your idea of using LCF as the facilitator of storing/retrieving such decoupled data (as opposed to just an xml file). It sounds like there's even a user interface for 'non-technical' staff to make acl configuration changes. That's really cool, and ultimately an elegant solution that will fit present and future needs.


Kind regards,
Peter


On Thu, Apr 29, 2010 at 1:24 AM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm more than happy to hear your customer's requirements, so no problem there.  It does seem to me that they are a bit different than what I've seen.  I think there is plenty of room for different flavors of solution, so please by all means go ahead and propose your take on it!

Karl

________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Wednesday, April 28, 2010 8:07 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite - to highlight requirements that have found to be necessary for customers/users, and to hopefully get the best functionality for the product. If you feel I've put you out on any of the issues raised, then I apologize for that, it was certainly not my intention.

Peter



RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Putting access control lookup at search-result time has the following benefits:

- It sees changes right away, when the underlying repository changes

Here are the drawbacks, as far as I can see:

- There's a significant extra load on the repository, because every search result has to be checked against the repository in real time
- It will perform very poorly on queries were there are a lot of matching documents, but the search user can't see most of them

Having only one general solution means that you have to pick one or the other of the two models.  We opted for the model we did because the drawbacks were potentially severe, especially under conditions of high demand.  The repository load question is not a trivial one, because it scales as the number of results returned, which is a potentially gigantic number.

However, I am perfectly fine with supporting both models.  Your suggested solution will work for some classes of problem.  It seems to me that in order to support it you will need a parallel infrastructure to do that.  We could develop that infrastructure within LCF, but it's a bit of work to do:

(1) Output an "internal repository document security identifier" into the index, in addition to tokens.  This id is not the same at all as the document's URI, which is what literal.id is currently set to, so a new solr schema field would need to be made for this.  All output connectors would need to be modified to do this, and all repository connectors as well.
(2) Since the security identifier would be valid within the context of a given repository connection, the "authority service" code that tries to verify visibility of a document given the authenticated user name and security identifier would need to look up the correct repository connection and call a method within it - which currently doesn't exist.  So we'd need to write such a method for all connectors that have security.
(3) Since this service would have a high load, and only be used under one particular model, I'd suggest actually defining a whole new webapp for it, so it can be distributed/controlled independently.

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 29, 2010 5:35 AM
To: connectors-user@incubator.apache.org
Cc: dev@lucene.apache.org; connectors-dev@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I guess it comes down to - any solution is ultimately going to place access control on a search and not on data, so there isn't much to be gained by binding the access control to the data. Whatever attributes exist at index time to build an acl will still be there at query time, so by making the acl search-bound, the acl is decoupled from the data, allowing it to be used in any use case scenario.

Here's a typical sampling of use cases where the decoupling of acl from data is required:

One customer has a  'shop-search' requirement where, logged-in users' access to various shops changes daily, sometimes 4 or 5 times a day. There are several hundred such shops and 10s of millions of documents, and the indexing part doesn't have ownership of any of the 'source' documents.

Another example is a customer who has multiple sites and multiple AD domains. They have one domain for the UK, but a completely separate domain for Gibraltar. When data is replicated to  remote servers accessed by Gibraltar staff, these users have no SID information in the other domain.

An 'interesting' example of this at the extreme is 34rkl4ys Bank, where, due to departmental history, they have no fewer than 85 AD domains! This of course is a nightmare in itself, but trying to tie access information to data at storage time is virtually impossible in this environment.

The thing I'm trying to understand is that the decoupled approach works equally well for the requirements where you do have acl information at index time. I guess I'm not understanding the advantages to making schema changes and binding acl to data, when there's really no need. I particularly like your idea of using LCF as the facilitator of storing/retrieving such decoupled data (as opposed to just an xml file). It sounds like there's even a user interface for 'non-technical' staff to make acl configuration changes. That's really cool, and ultimately an elegant solution that will fit present and future needs.


Kind regards,
Peter


On Thu, Apr 29, 2010 at 1:24 AM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm more than happy to hear your customer's requirements, so no problem there.  It does seem to me that they are a bit different than what I've seen.  I think there is plenty of room for different flavors of solution, so please by all means go ahead and propose your take on it!

Karl

________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Wednesday, April 28, 2010 8:07 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite - to highlight requirements that have found to be necessary for customers/users, and to hopefully get the best functionality for the product. If you feel I've put you out on any of the issues raised, then I apologize for that, it was certainly not my intention.

Peter



Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

I guess it comes down to - any solution is ultimately going to place access
control on a search and not on data, so there isn't much to be gained by
binding the access control to the data. Whatever attributes exist at index
time to build an acl will still be there at query time, so by making the acl
search-bound, the acl is decoupled from the data, allowing it to be used in
any use case scenario.

Here's a typical sampling of use cases where the decoupling of acl from data
is required:

One customer has a  'shop-search' requirement where, logged-in users' access
to various shops changes daily, sometimes 4 or 5 times a day. There are
several hundred such shops and 10s of millions of documents, and the
indexing part doesn't have ownership of any of the 'source' documents.

Another example is a customer who has multiple sites and multiple AD
domains. They have one domain for the UK, but a completely separate domain
for Gibraltar. When data is replicated to  remote servers accessed by
Gibraltar staff, these users have no SID information in the other domain.

An 'interesting' example of this at the extreme is 34rkl4ys Bank, where, due
to departmental history, they have no fewer than 85 AD domains! This of
course is a nightmare in itself, but trying to tie access information to
data at storage time is virtually impossible in this environment.

The thing I'm trying to understand is that the decoupled approach works
equally well for the requirements where you do have acl information at index
time. I guess I'm not understanding the advantages to making schema changes
and binding acl to data, when there's really no need. I particularly like
your idea of using LCF as the facilitator of storing/retrieving such
decoupled data (as opposed to just an xml file). It sounds like there's even
a user interface for 'non-technical' staff to make acl configuration
changes. That's really cool, and ultimately an elegant solution that will
fit present and future needs.


Kind regards,
Peter


On Thu, Apr 29, 2010 at 1:24 AM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> I'm more than happy to hear your customer's requirements, so no problem
> there.  It does seem to me that they are a bit different than what I've
> seen.  I think there is plenty of room for different flavors of solution, so
> please by all means go ahead and propose your take on it!
>
> Karl
>
> ________________________________________
> From: ext Peter Sturge [peter.sturge@googlemail.com]
> Sent: Wednesday, April 28, 2010 8:07 PM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org;
> connectors-dev@incubator.apache.org; lucene-dev@apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> I wasn't trying to to put pay to your design proposal, really the opposite
> - to highlight requirements that have found to be necessary for
> customers/users, and to hopefully get the best functionality for the
> product. If you feel I've put you out on any of the issues raised, then I
> apologize for that, it was certainly not my intention.
>
> Peter
>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

I guess it comes down to - any solution is ultimately going to place access
control on a search and not on data, so there isn't much to be gained by
binding the access control to the data. Whatever attributes exist at index
time to build an acl will still be there at query time, so by making the acl
search-bound, the acl is decoupled from the data, allowing it to be used in
any use case scenario.

Here's a typical sampling of use cases where the decoupling of acl from data
is required:

One customer has a  'shop-search' requirement where, logged-in users' access
to various shops changes daily, sometimes 4 or 5 times a day. There are
several hundred such shops and 10s of millions of documents, and the
indexing part doesn't have ownership of any of the 'source' documents.

Another example is a customer who has multiple sites and multiple AD
domains. They have one domain for the UK, but a completely separate domain
for Gibraltar. When data is replicated to  remote servers accessed by
Gibraltar staff, these users have no SID information in the other domain.

An 'interesting' example of this at the extreme is 34rkl4ys Bank, where, due
to departmental history, they have no fewer than 85 AD domains! This of
course is a nightmare in itself, but trying to tie access information to
data at storage time is virtually impossible in this environment.

The thing I'm trying to understand is that the decoupled approach works
equally well for the requirements where you do have acl information at index
time. I guess I'm not understanding the advantages to making schema changes
and binding acl to data, when there's really no need. I particularly like
your idea of using LCF as the facilitator of storing/retrieving such
decoupled data (as opposed to just an xml file). It sounds like there's even
a user interface for 'non-technical' staff to make acl configuration
changes. That's really cool, and ultimately an elegant solution that will
fit present and future needs.


Kind regards,
Peter


On Thu, Apr 29, 2010 at 1:24 AM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> I'm more than happy to hear your customer's requirements, so no problem
> there.  It does seem to me that they are a bit different than what I've
> seen.  I think there is plenty of room for different flavors of solution, so
> please by all means go ahead and propose your take on it!
>
> Karl
>
> ________________________________________
> From: ext Peter Sturge [peter.sturge@googlemail.com]
> Sent: Wednesday, April 28, 2010 8:07 PM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org;
> connectors-dev@incubator.apache.org; lucene-dev@apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> I wasn't trying to to put pay to your design proposal, really the opposite
> - to highlight requirements that have found to be necessary for
> customers/users, and to hopefully get the best functionality for the
> product. If you feel I've put you out on any of the issues raised, then I
> apologize for that, it was certainly not my intention.
>
> Peter
>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

I guess it comes down to - any solution is ultimately going to place access
control on a search and not on data, so there isn't much to be gained by
binding the access control to the data. Whatever attributes exist at index
time to build an acl will still be there at query time, so by making the acl
search-bound, the acl is decoupled from the data, allowing it to be used in
any use case scenario.

Here's a typical sampling of use cases where the decoupling of acl from data
is required:

One customer has a  'shop-search' requirement where, logged-in users' access
to various shops changes daily, sometimes 4 or 5 times a day. There are
several hundred such shops and 10s of millions of documents, and the
indexing part doesn't have ownership of any of the 'source' documents.

Another example is a customer who has multiple sites and multiple AD
domains. They have one domain for the UK, but a completely separate domain
for Gibraltar. When data is replicated to  remote servers accessed by
Gibraltar staff, these users have no SID information in the other domain.

An 'interesting' example of this at the extreme is 34rkl4ys Bank, where, due
to departmental history, they have no fewer than 85 AD domains! This of
course is a nightmare in itself, but trying to tie access information to
data at storage time is virtually impossible in this environment.

The thing I'm trying to understand is that the decoupled approach works
equally well for the requirements where you do have acl information at index
time. I guess I'm not understanding the advantages to making schema changes
and binding acl to data, when there's really no need. I particularly like
your idea of using LCF as the facilitator of storing/retrieving such
decoupled data (as opposed to just an xml file). It sounds like there's even
a user interface for 'non-technical' staff to make acl configuration
changes. That's really cool, and ultimately an elegant solution that will
fit present and future needs.


Kind regards,
Peter


On Thu, Apr 29, 2010 at 1:24 AM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> I'm more than happy to hear your customer's requirements, so no problem
> there.  It does seem to me that they are a bit different than what I've
> seen.  I think there is plenty of room for different flavors of solution, so
> please by all means go ahead and propose your take on it!
>
> Karl
>
> ________________________________________
> From: ext Peter Sturge [peter.sturge@googlemail.com]
> Sent: Wednesday, April 28, 2010 8:07 PM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org;
> connectors-dev@incubator.apache.org; lucene-dev@apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> I wasn't trying to to put pay to your design proposal, really the opposite
> - to highlight requirements that have found to be necessary for
> customers/users, and to hopefully get the best functionality for the
> product. If you feel I've put you out on any of the issues raised, then I
> apologize for that, it was certainly not my intention.
>
> Peter
>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I'm more than happy to hear your customer's requirements, so no problem there.  It does seem to me that they are a bit different than what I've seen.  I think there is plenty of room for different flavors of solution, so please by all means go ahead and propose your take on it!

Karl

________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com]
Sent: Wednesday, April 28, 2010 8:07 PM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; connectors-dev@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite - to highlight requirements that have found to be necessary for customers/users, and to hopefully get the best functionality for the product. If you feel I've put you out on any of the issues raised, then I apologize for that, it was certainly not my intention.

Peter


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I'm more than happy to hear your customer's requirements, so no problem there.  It does seem to me that they are a bit different than what I've seen.  I think there is plenty of room for different flavors of solution, so please by all means go ahead and propose your take on it!

Karl

________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com]
Sent: Wednesday, April 28, 2010 8:07 PM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; connectors-dev@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite - to highlight requirements that have found to be necessary for customers/users, and to hopefully get the best functionality for the product. If you feel I've put you out on any of the issues raised, then I apologize for that, it was certainly not my intention.

Peter


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I'm more than happy to hear your customer's requirements, so no problem there.  It does seem to me that they are a bit different than what I've seen.  I think there is plenty of room for different flavors of solution, so please by all means go ahead and propose your take on it!

Karl

________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com]
Sent: Wednesday, April 28, 2010 8:07 PM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; connectors-dev@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite - to highlight requirements that have found to be necessary for customers/users, and to hopefully get the best functionality for the product. If you feel I've put you out on any of the issues raised, then I apologize for that, it was certainly not my intention.

Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite -
to highlight requirements that have found to be necessary for
customers/users, and to hopefully get the best functionality for the
product. If you feel I've put you out on any of the issues raised, then I
apologize for that, it was certainly not my intention.

Peter

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite -
to highlight requirements that have found to be necessary for
customers/users, and to hopefully get the best functionality for the
product. If you feel I've put you out on any of the issues raised, then I
apologize for that, it was certainly not my intention.

Peter

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite -
to highlight requirements that have found to be necessary for
customers/users, and to hopefully get the best functionality for the
product. If you feel I've put you out on any of the issues raised, then I
apologize for that, it was certainly not my intention.

Peter

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Yes, I don't doubt that using an external mechanism such as AD lockout will
work for those and other environments. I guess it comes down to the
difference between bespoke consultancy-type solutions and general-purpose
product solutions, of which the requirements are often very different. For a
general Access Control solution integrated into Solr, assumptions on the
presence/type of such external controls can't, and should't be assumed. If
they are/must be assumed, one of the core reasons for adding the new
functionality is missing.

As a starting point, for a general purpose access control system, at least
the following questions need to be addressed:
   * What happens when access control needs to change?
   * What happens if access control needs to change often (e.g. more than
several times a day)?
   * Can the access control cope with multiple data source types, without
the need for custom code, including data with no attached acl information?
   * If I change my access control, how is 'offline' data affected? (e.g.
backed-up data)
   * Will the access control satisfy regulatory compliance specs on it own,
or is an external mechanism required?
      (currently, Solr requires an external mechanism, but so also the
proposed solution)

As you might have guessed, I've been down this road before, and the
productization of security control has many facets, and these, as a general
rule, need to be addressed differently in products than in site-specific
deployments - mainly because products can't assume the envinroment(s) they
will run in (e.g. Active Directory).

The good thing is, there is a good alternative - that is: to store access
control information separately from indexed data and separately from an
authority. To me, that's where the beauty of an LCF plugin architecture
lives. Then, the task is to provide the integration tools (and it sounds
like LCF is very well suited for this) to deliver the 'bridge' between
content and authorization. (as you quite rightly said, authentication is a
separate, albeit related, subject)

Thanks,
Peter




On Wed, Apr 28, 2010 at 12:46 PM, <ka...@nokia.com> wrote:

>  >>>>>>
> With regards schema extension, I believe we need to be very careful here,
> as requiring index-time storage of access control data will pose a problem
> for any use cases where the access control needs to change (maybe often,
> maybe only occasionally). I'm trying to think of a use case where this
> wouldn't at least potentially be the case, and I can't think of one, but
> perhaps I'm not truly understanding what exactly is stored in the
> __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl
> changes would fit in (e.g. let's say someone has left my organization, do I
> have to update documents to remove his/her access?).
> <<<<<<
>
> Usually the way this works is that the user's account is locked out so they
> can't log in.  The authority service picks up this change, and it therefore
> takes place immediately.
>
> Bear in mind that this particular model has been employed by MetaCarta for
> more than five years in the field with clients such as pretty near all the
> major oil companies, many U.S. government agencies, the U.S. military, etc.
> In that time we have not heard even one complaint about the security model.
>
> Karl
>
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Wednesday, April 28, 2010 7:18 AM
>
> *To:* dev@lucene.apache.org
> *Cc:* connectors-dev@incubator.apache.org;
> connectors-user@incubator.apache.org; lucene-dev@apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Apologies for the delayed reply. I've been away on business, and in the
> middle of a product release, so it's been a busy time...
>
> In response to your eariler questions:
>
> The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses,
> although the point at which these are done is slightly different.
>
> I think I am correct in my understanding that with filter queries, the
> results are filtered 'post-Lucene', but are separately (Solr) cached, so you
> get a hit on the first search, but then benefit from cached hits on
> subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are
> applied at the Lucene query directly, so don't have separate Solr caching.
> I've not benchmarked the two, so one or other might be slower/faster for
> various search scenarios.
>
> In any case, I believe either technique can be employed in either 1834 or
> 1872.
>
>
> With regards schema extension, I believe we need to be very careful here,
> as requiring index-time storage of access control data will pose a problem
> for any use cases where the access control needs to change (maybe often,
> maybe only occasionally). I'm trying to think of a use case where this
> wouldn't at least potentially be the case, and I can't think of one, but
> perhaps I'm not truly understanding what exactly is stored in the
> __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl
> changes would fit in (e.g. let's say someone has left my organization, do I
> have to update documents to remove his/her access?).
>
> Also, would such indexed tokens be entirely 'document-context-free'? I.e.
> Would the same type/format of tokens be used for data from different sources
> (e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be
> compatible with multiple and/or changing authorities (e.g. AD, documentum,
> LDAP, custom, etc.)?
>
> I like the idea of an LCF plugin to hold the acl data. I admit, I've not
> had enough time to look into how this might look at the moment, but it
> sounds like it could be a good way to hold generic (authority-agnostic) acl
> data, and [hopefully] not have to tie it to document data at index-time.
>
> I hope this makes sense, but if I've misunderstood the proposed mechanism,
> please correct me. Would the __ALLOW_TOKEN__ et al fields store, for
> example, SID information?
>
>
> Thanks,
> Peter
>
>
>
> On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com> wrote:
>
>> Ok, not hearing back from Peter, I've done some Solr research and written
>> some code that might work.  The approach I've taken is most similar to SOLR
>> 1834, other than the LCF-centric logic.  Hopefully there will be a chance to
>> try this out in a full end-to-end way  on the weekend, after which I will
>> submit it to the Solr team (where I think it most naturally would be built
>> and delivered).
>>
>> What it's going to need is either a static or dynamic schema addition to
>> define __ALLOW_TOKEN__document, __DENY_TOKEN__document,
>> __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be
>> string, multivalued fields (I think).  It would be great if these could be
>> made a default part of Solr; similarly, it would be good if the new search
>> component was predelivered with Solr and mentioned (even if commented out)
>> in the example solrconfig.xml file.  The only other thing that needs to be
>> done to hook up the search component is to include a configuration parameter
>> describing the base URL of the LCF authority service.  Plus, as I said
>> earlier, we still don't have a canned solution for authentication yet -
>> although I feel that will be straightforward.
>>
>> Comments welcome...
>> Karl
>>
>>
>> ________________________________________
>> From: Wright Karl (Nokia-S/Cambridge)
>> Sent: Tuesday, April 27, 2010 8:20 AM
>> To: connectors-dev@incubator.apache.org; dev@lucene.apache.org
>> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org
>> Subject: RE: FW: Solr and LCF security at query time
>>
>> Hi Peter,
>>
>> I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions
>> in detail, and have a couple of SOLR-related questions.
>>
>> Both contributions rely on a SearchComponent to work their magic.
>>  However, it also appears that each modifies the user query in a different
>> way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses
>> standard AND and OR filterquery clauses.  Both of them are constructed using
>> Solr FilterQuery objects.  Here are my questions:
>>
>> (1) I am not conversant enough with Solr yet to know the difference
>> between the different kinds of clause structure.  Do you know if there is a
>> difference?  For example, is there any possibility that AND/OR clauses can
>> permit documents to be seen that should not be seen?  (MUST and MUST_NOT
>> sound a lot more definite...)
>>
>> (2) Are Solr FilterQuery objects applied to constructing the query that
>> will be sent to Lucene?  Or are they applied by Solr after-the-fact to the
>> resultset?  Or, is it a combination of the two, depending on the details of
>> your actual filter clause?
>>
>> I also haven't heard much from you in the last week or so - have you
>> thought further about what you intend to do, and can you let me know whether
>> you are still interested in developing an LCF plugin for Solr?
>>
>> Thanks,
>> Karl
>>
>> -----Original Message-----
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> Sent: Thursday, April 22, 2010 12:23 PM
>> To: dev@lucene.apache.org
>> Cc: connectors-dev@incubator.apache.org;
>> connectors-user@incubator.apache.org; lucene-dev@apache.org
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> See inline...
>>
>> On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:
>>
>> > Hi Peter,
>> >
>> > The authority connectors don't perform authentication at this time.
>> > In fact, LCF has nothing to do with authentication at all - just
>> authorization.
>> >  The reason for this is because it is almost never the case that
>> > somebody wants to provide multiple credentials in order to be able see
>> their results.
>> >  Most enterprises who have multiple repositories authenticate against
>> > AD and then map AD user names to repository user names in order to
>> > access those repositories.  If you noted my earlier posts from this
>> > morning, you may have noted that I'm looking at recommending JAAS plus
>> > sun's kerb5 login module for handling the "authenticate against AD"
>> > case, which would cover some 95%+ of the real world authentication
>> needed out there.
>> >
>> >
>> I did read your earlier post regarding this, and I totally agree with you
>> - this is best handled 'upstream'. In fact, I use a JAAS plugin in other
>> places in the product (not Solr) for authentication.
>>
>>
>> >
>> > Yes, the idea is to store SIDs in solr at index time.  I don't know
>> > enough about solr to know what kinds of issues this might entail, but
>> > Lucene certainly has a model of metadata that's pretty flexible, so I
>> > don't think this would be difficult at all.  Eric Hatcher also seemed
>> > to confirm my suspicions that this would not be a problem.
>> >
>>
>> It's certainly not a problem to store this data in Solr. The problem is
>> more that you don't really *want* to store this data at index time.
>> There are lots of reasons for not wanting to 'hard-code' SID data with
>> documents in the index. Here's just a few:
>>  * What happens if/when you want to add explicit user access to some
>> [group of] documents ? (i.e. not via a group)
>>  * What happens if you need to revoke or change a user's or group's
>> access?
>>  * It's difficult to move/replicate the index to another domain
>>  * For AD, SIDs are generally not meant to be stored long term outside of
>> AD, as they can be changed (this doesn't happen often, but it can happen
>> after an AD rebuild, domain type upgrade, data recovery etc.)
>>
>> These and other senarios mean re-indexing the stored data. When the index
>> is huge, this is non-trivial (time-wise). There are not uncommon scenarios
>> where user/group access control can change multiple times in one day.
>>
>> There might be a way of storing acl data in a payload or similar, but I'm
>> not sure how that would work across millions of [arbitrarily grouped ]
>> documents (I'm not familiar enough with payloads to know if this would be a
>> good or bad idea).
>>
>>
>> >
>> > This is exactly why I think that we need to do the authentication
>> > upstream of the authority world.
>> >
>> >
>> Agreed.
>>
>>
>>
>> >
>> > If Solr handles arbitrary document metadata, then I think we could
>> > just use that feature.  But you know more about it than me, at this
>> > point.  It would be great to get an overview of potential ways of doing
>> this.
>> >
>> >
>> Payloads, maybe?
>>
>>
>> >
>> > For your particular task, it sounds like you are trying to read from
>> > NTFS and apply security after-the-fact with some acl specification
>> > file.  In that case, I'd write a repository connector that was based
>> > on the file system connector (already part of the stable of connectors
>> > for LCF) which reads ACL information from your acl.xml file.  Or, if
>> > you prefer a UI for specifying ACL information, you could extend the
>> > connector so that security is configured in the UI without having an
>> > external acl.xml file at all - which would be a nice addition to the
>> > existing file system connector.  (Repository connections and jobs are
>> > configured internally in LCF by XML documents stored in the database,
>> > so they can be arbitrarily structured.  I'm happy to help you figure
>> > out how to do this if this is what you decide to do.)
>> >
>> > For my particular requirements, there are no files -  the data is
>> > generated
>> from the network and stored. After the fact, there is no persistent
>> location of this data other than in Solr.
>>
>> Storing the acl info using the connector sounds very interesting. Could be
>> worth looking at in more details. Thanks!
>>
>>
>>
>> > I think we still need to add in the authentication piece to make this
>> > all work for you, so perhaps you can describe how you expect a user to
>> > interact with your system, so I can understand your design issues.
>> >
>> > Thanks,
>> > Karl
>> >
>> >
>>
>>
>>
>>
>>
>>
>>
>>
>> > -----Original Message-----
>> > From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> > Sent: Thursday, April 22, 2010 11:32 AM
>> > To: dev@lucene.apache.org
>> > Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> > connectors-dev@incubator.apache.org
>> > Subject: Re: FW: Solr and LCF security at query time
>> >
>> > Hi Karl,
>> >
>> > Thanks very much for your detailed explanation - really good!
>> >
>> > As I've thought through some of the implications, I've added comments
>> > below, so I hope they don't seem too jumbled...
>> >
>> > I suppose on the 'authority' side, it works kind of as I envisioned it
>> > would.
>> >
>> > For general Solr access control, there's two layers of security that
>> > need to be addressed:
>> >  1. Authentication - make sure the incoming query is from a valid
>> > user, and the passed-in credentials (hash, certificate etc.) are
>> > correct  2. Query filtering - potentially reduce the number/type of
>> > returned results based on the allow/deny metadata for the
>> > authenticated user
>> >
>> > I can see how the LCF auth connector works for 2., but can it do 1. as
>> > well?
>> > It would be good if this could somehow be integrated into any
>> > container (Tomcat/Jetty et al) authentication that might be configured
>> > (probably related to your previous post). I many ways, it could/should
>> > be that the Authority (AD) part of the connector should only be
>> > concerned with 1. and not 2. (see below).
>> >
>> > So, on the repository side, there is also an LCF connector that
>> > 'closes the loop' to provide the 'what is it I'm trying to control' side
>> of things.
>> > I understand that LCF doesn't do the mapping - it delegates this task
>> > to the caller, but provides both sides of the equation (authority &
>> > repository).
>> >
>> > >>>>>
>> > - Each file in DirectoryA will have the following
>> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890",
>> and "myAD:S-23-64-12345".
>> > - Each file in DirectoryB will have the following
>> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
>> > <<<<<
>> > I think this is the bit that is worrying me - is this storing the SIDs
>> > into Solr at document index time? This would be a problem for a whole
>> > load of reasons, but maybe I'm missing something here? (see below for
>> > a possible
>> > alternative)
>> >
>> > Basically, what I'm getting at here is that the allow/deny values need
>> > to be stored in one of three places:
>> >  1. In the authority (e.g. inside AD)
>> >  2. In the document metadata (index-time)  3. In external storage
>> > (e.g. acl.xml, NTFS etc.)
>> >
>> > 1. Extending AD is pretty much out, as this causes too many interop
>> > problems 2. 'Hard-coding' acl information in the index makes it
>> > non-portable, resistent to changes, etc.
>> > 3. acl.xml is coupled with a Solr instance, but is easily
>> > ported/replicated.
>> > Storing/retrieving acl information from the source (e.g. NTFS) is
>> > problematic, as the source may not be accessible (it may not even
>> exist).
>> >
>> > I believe 3. or a variant is the way to go on the repo side, which
>> > means the LCF Authority connector is mainly for Authentication (see
>> > above), which is what you want from AD et al integration.
>> > The problem that arises from 'pluggable' authentication is that, if
>> > you're not using a certificate, you have to start with a password, but
>> > the connector only has access to the password hash (unless the pwd is
>> > sent in the query url). I don't know of a way to confirm identities in
>> > AD using only the username and hash (AD does the hash compare). I
>> > believe this is where container-based integration will likely work
>> better.
>> >
>> > So that I can confirm my understanding...a scenario might be like this:
>> >
>> > We have an AD connector that fetches the SIDs and we can read them etc.
>> > For my environment, where there are no 'files' (there's only a
>> > transient network stream), we have an LCF 'Solr Field Filter Query'
>> > connector that decides which Filter Queries to apply (allow and deny)
>> > for the passed in SID(s).
>> >
>> > For another environment, let's say, NTFS, there might be an 'NTFS'
>> > connector that would provide some kind of mapping of files/folders to
>> > SID(s). Since Solr wouldn't intrinscially know about this, the acl
>> > information would need to be stored somewhere in the index. This would
>> > mean extending the Solr schema and storing metadata at index time.
>> > The alternative is to re-use the 'Solr Field Filter Query' connector
>> > for this as well (and any other document types that might be read in).
>> > This keeps the index 'clean' of acl-specific metadata, and allows for
>> > in-place changes and easy cross-document/index/instance access control.
>> >
>> >
>> > If the above interpretation is [roughly] correct (please let me know
>> > if I've got this wrong!), this would reduce down to having:
>> >   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
>> > (possibly/partly at the container level)
>> >   2. At least an LCF Repository connector for 'acl.xml'
>> >   3. Optional other LCF Repository connectors
>> >
>> > It sounds like you've now finished the first half of 1. by adding the
>> > ability to get the required auth data from a Solr api call. The other
>> > half of 1. will be implementing the LCF interface in the
>> > SolrACLSecurity class, to effectively replace the 'user', 'group' and
>> 'password' bits of acl.xml.
>> >
>> > Does the above sound like an accurate interpretation? Just trying to
>> > get a good picture of what work needs doing, where it goes, etc.
>> >
>> > Many thanks!
>> > Peter
>> >
>> >
>> >
>> >
>> > On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>> >
>> > >  >>>>>>
>> > > What is the relationship between stored data (documents) and
>> authorities'
>> > > access/deny attributes? (do you have any examples of what an
>> > > access_token value might contain?) <<<<<<
>> > >
>> > > Documents have access/deny attributes; authorities simply provide
>> > > the list of tokens that belong to an authenticated user.  Thus,
>> > > there's no access/deny for an authority; that's attached to the
>> > > document (as it is in real-world repositories).
>> > >
>> > > Let's run a quick example, using Active Directory and a Windows file
>> > > system.  Suppose that you have a directory with documents in it,
>> > > call it DirectoryA, and the directory allows read access to the
>> > > following
>> > SIDs:
>> > >
>> > > S-123-456-76890
>> > > S-23-64-12345
>> > >
>> > > These SIDs correspond to active directory groups, let's call them
>> > > Group1 and Group2, respectively.
>> > >
>> > > DirectoryB also has documents in it, and those documents have just
>> > > the SID S-123-456-76890 attached, because only Group1 can read its
>> contents.
>> > >
>> > > Now, pretend that someone has created an LCF Active Directory
>> > > authority connection (in the LCF UI), which is called "myAD", and
>> > > this connection is set up to talk to the governing AD domain
>> > > controller for this Windows file system.  We now know enough to
>> > > describe the document
>> > indexing process:
>> > >
>> > > - Each file in DirectoryA will have the following
>> > > __ALLOW_TOKEN__document attributes inside Solr:
>> > > "myAD:S-123-456-76890",
>> > and "myAD:S-23-64-12345".
>> > > - Each file in DirectoryB will have the following
>> > > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
>> > >
>> > > Now, suppose that a user (let's call him "Peter") is authenticated
>> > > with the AD domain controller.  Peter belongs to Group2, so his SIDs
>> > > are
>> > (say):
>> > >
>> > > S-1-1-0 (the 'everyone' SID)
>> > > S-323-999-12345 (his own personal user SID)
>> > > S-23-64-12345 (the SID he gets because he belongs to group 2)
>> > >
>> > > We want to look up the documents in the search index that he can see.
>> > > So, we ask the LCF authority service what his tokens are, and we get
>> > back:
>> > >
>> > > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
>> > >
>> > > The documents we should return in his search are the ones matching
>> > > his search criteria, PLUS the intersection of his tokens with the
>> > > document ALLOW tokens, MINUS the intersection of his tokens with the
>> > > document DENY tokens (there aren't any involved in this example).
>> > > So only files that have one of his three tokens as an ALLOW
>> > > attribute would be
>> > returned.
>> > >
>> > > Note that what we are attempting to do is enforce AD's security with
>> > > the search results we present.  There is no need to define a whole
>> > > new security mechanism, because AD already has one that people use.
>> > >
>> > > >>>>>>
>> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
>> > > to ensure there are no security or other dependencies of indexed
>> > > data with any external repository - most notably the file system.
>> > > There are many reasons for wanting this, but one of the main ones is
>> > > that Solr-stored data is not always based on file data (or
>> > > accessible
>> > file data).
>> > > In fact, in my particular case, almost none of the indexed data
>> > > comes from files.
>> > > <<<<<<
>> > >
>> > > LCF is all about abstracting from repositories.  It's not
>> > > specifically about a file system, although that is a convenient
>> > > example.  If you are building your own kind of repository with your
>> > > own security setup, that's fine - but in the LCF world you'd need to
>> > > create an authority connector for your repository (which maybe reads
>> > > your acl.xml file), as well as a repository connector (which hands
>> > > documents to LCF and provides it with the access tokens that make
>> > > security work).  Of course, you can something much lighter that
>> > > doesn't include LCF at all if you are just integrating a custom
>> > > repository of your own, but it sounded like you were interested in the
>> broader problem here.
>> > >
>> > > So, LCF doesn't do "acl mapping" at all.  It relies on its various
>> > > connectors to work cooperatively to define access tokens in a way
>> > > that is consistent from authority connector to repository connector
>> > > for a given repository kind.  Anybody can write a connector, so the
>> > > beauty of all this is that you can build a system where data from
>> > > many disparate sources is indexed, and security for each is
>> > > simultaneously
>> > enforced.
>> > >
>> > > Karl
>> > >
>> > >
>> > >  ------------------------------
>> > > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> > > *Sent:* Thursday, April 22, 2010 9:24 AM
>> > >
>> > > *To:* dev@lucene.apache.org
>> > > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> > > connectors-dev@incubator.apache.org
>> > > *Subject:* Re: FW: Solr and LCF security at query time
>> > >
>> > > Hi Karl,
>> > >
>> > > Thanks very much for the diagram -
>> > > Sorry about all the questions, but this raises a few new ones...
>> > >
>> > > What is the relationship between stored data (documents) and
>> authorities'
>> > > access/deny attributes? (do you have any examples of what an
>> > > access_token value might contain?)
>> > >
>> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
>> > > to ensure there are no security or other dependencies of indexed
>> > > data with any external repository - most notably the file system.
>> > > There are many reasons for wanting this, but one of the main ones is
>> > > that Solr-stored data is not always based on file data (or
>> > > accessible
>> > file data).
>> > > In fact, in my particular case, almost none of the indexed data
>> > > comes from files.
>> > >
>> > > This is one reason why SOLR-1872 uses filter queries for its
>> > > access/deny tokens - so that all the required information for access
>> > > control completely resides within the Solr index itself.
>> > > Is the LCF architecture acl 'mapping' between Solr fields (queries)
>> > > and users, some external 'repository' (files) and users, or
>> > > arbitrary
>> > data (e.g.
>> > > either of these)?
>> > >
>> > > I hope that makes sense...
>> > >
>> > > Thanks!
>> > > Peter
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
>> > >
>> > >> Hi Peter,
>> > >>
>> > >> I've attached a diagram that is not in the wiki as of yet, and I'll
>> > >> try to answer your questions.
>> > >>
>> > >> >>>>>>
>> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
>> > >> stored for a particular user in the underlying acl store (e.g.
>> > >> Active
>> > Directory)?
>> > >> How does AD and/or LCF handle storing such data in its schema?
>> > >> (does AD needs its schema extended?) Presumably, any such AD fields
>> > >> would need to be queried for effective rights in order to cater for
>> > >> group membership allows and denies.
>> > >> <<<<<<
>> > >>
>> > >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
>> > >> strings that represent a contract between an LCF authority
>> > >> connection and the LCF repository connection that picks up the
>> > >> documents (from
>> > wherever).
>> > >>  These tokens thus have no real meaning outside of LCF.  You must
>> > >> regard them as opaque.
>> > >>
>> > >> The contract, however, states that if you use the LCF authority
>> > >> service to obtain tokens for an authenticated user, you will get
>> > >> back a set that is CONSISTENT with the tokens that were attached to
>> > >> the documents LCF sent to Solr for indexing in the first place.
>> > >> So, you don't have to worry about it, and that's kind of the idea.
>> > >> So you
>> > imagine the following flow:
>> > >>
>> > >> (1) Use LCF to fetch documents and send them to Solr
>> > >> (2) When searching, use the LCF authority service to get the
>> > >> desired user's access tokens
>> > >> (3) Either filter the results, or modify the query, to be sure the
>> > >> access tokens all match up properly
>> > >>
>> > >> For the AD authority, the LCF access tokens consist, in part, of
>> > >> the user's SIDs.  For other authorities, the access tokens are
>> > >> wildly
>> > different.
>> > >>  You really don't want to know what's in them, since that's the job
>> > >> of the LCF authority to determine. ;-)
>> > >>
>> > >> LCF is not, by the way, joined at the hip with AD.  However, in
>> > >> practice, most enterprises in the world use some form of AD single
>> > >> signon for their web applications, and even if they're using some
>> > >> repository with its own idea of security, there's a mapping between
>> > >> the AD users and the repository's users.  Doing that mapping is
>> > >> also the job of the LCF authority for that repository.
>> > >>
>> > >> Hope this helps.  Also, I'm not expecting time miracles here, so
>> > >> don't sweat the schedule.
>> > >>
>> > >>
>> > >> Karl
>> > >>
>> > >>
>> > >> ________________________________________
>> > >> From: ext Peter Sturge [peter.sturge@googlemail.com]
>> > >> Sent: Thursday, April 22, 2010 4:27 AM
>> > >> To: dev@lucene.apache.org
>> > >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> > >> connectors-dev@incubator.apache.org
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> Hi Karl,
>> > >>
>> > >> Thanks for the quick turnaround.
>> > >> I'm in the middle of a product release for us, so I fear I won't be
>> > >> as quick as you... :-)
>> > >>
>> > >> I couldn't find a simple flow diagram or similar for LCF with
>> > >> regards security (probably looking in the wrong place).
>> > >> Perhaps you could help on these questions...?
>> > >>
>> > >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
>> > >> sub-queries, which are then used as filter queries in a user's
>> search.
>> > >>
>> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
>> > >> stored for a particular user in the underlying acl store (e.g.
>> > >> Active
>> > Directory)?
>> > >> How does AD and/or LCF handle storing such data in its schema?
>> > >> (does AD needs its schema extended?) Presumably, any such AD fields
>> > >> would need to be queried for effective rights in order to cater for
>> > >> group membership allows and denies.
>> > >>
>> > >> I guess I'm just trying to understand the architectural
>> > >> flow/storage/retrieval of data in the various parts of the system,
>> > >> but I admit, I need to do more research on this.
>> > >> After our product release, when I get a few more spare cycles, I
>> > >> can look at it in more detail.
>> > >>
>> > >> Many thanks!
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> Hi Peter,
>> > >>
>> > >> I just committed the promised changes to the LCF Solr output
>> connector.
>> > >>
>> > >> ACL metadata will now be posted to the Solr Http interface along
>> > >> with the document as the two following fields:
>> > >>
>> > >> __ACCESS_TOKEN__document
>> > >> __DENY_TOKEN__document
>> > >>
>> > >> There will, of course, potentially be multiple values for each of
>> > >> these two fields.
>> > >>
>> > >> Hope this helps,
>> > >> Karl
>> > >>
>> > >> ________________________________
>> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> > >> peter.sturge@googlemail.com>]
>> > >> Sent: Tuesday, April 20, 2010 6:51 PM
>> > >>
>> > >> To: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> Hi Karl,
>> > >>
>> > >> Thanks for the info. I'll have a look at the link and try to take
>> > >> in as much sugar as my insulin levels will handle...
>> > >> It sounds like the necessary interface(s) are already in LCF - just
>> > >> a matter of implementing them in the Solr 1872 plugin.
>> > >> I'll need to digest the LCF stuff to get to grips with it..please
>> > >> bear with me while I do that...
>> > >>
>> > >> When you say:
>> > >>   The LCF solr output connection doesn't yet do this, but it is
>> > >> trivial for me to make that happen.
>> > >> Do you mean a mechanism by which solr.war can get url et al info
>> > >> from its parent container (Tomcat, Jetty etc.), or have I
>> > >> misinterpreted
>> > this?
>> > >>
>> > >>
>> > >> Thanks,
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> Hi Peter,
>> > >>
>> > >> I'm the principal committer for LCF, but I don't know as much about
>> > >> Solr as I ought to, so it sounds like a potentially productive
>> > collaboration.
>> > >>
>> > >> LCF does exactly what you are looking for - the only issue at all
>> > >> is that you need to fetch a URL from a webapp to get what you are
>> > >> looking for.  The "plugs" are all inside LCF for different kinds of
>> > >> repositories.  Here's a link that might help with drinking the LCF
>> > "koolaid", as it were:
>> > >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
>> > >> ct
>> > >> ors+Framework+concepts
>> > >>
>> > >> The url would be something like this (on a locally installed
>> > >> tomcat-based LCF instance):
>> > >>
>> > >>
>> > >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
>> > >> se
>> > >> rname@somedomain.com
>> > >>
>> > >> ... and this fetch returns something like:
>> > >>
>> > >> TOKEN:xxxxxxx
>> > >> TOKEN:yyyyyyy
>> > >> TOKEN:zzzzzzz
>> > >> ....
>> > >>
>> > >> ... which represent the amalgamated tokens for all of the defined
>> > >> authorities, and by some strange coincidence ( ;-) ) are compatible
>> > >> with certain pieces of metadata that have been passed into Solr
>> > >> with each document - one set of Allow tokens, and a second set of
>> > >> Deny tokens.  The LCF solr output connection doesn't yet do this,
>> > >> but it is trivial for me to make that happen.
>> > >>
>> > >> Does this sound plausible to you?
>> > >>
>> > >> Karl
>> > >>
>> > >>
>> > >> ________________________________
>> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> > >> peter.sturge@googlemail.com>]
>> > >> Sent: Tuesday, April 20, 2010 5:41 PM
>> > >> To: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
>> > >> dev@lucene.apache.org>
>> > >>
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> Hi Karl,
>> > >>
>> > >> Integrating LCF to get external token support for SOLR-1872 sounds
>> > >> very interesting indeed. I don't know anything about LCF, but one
>> > >> of the things I was planning for SOLR-1872 is to make acl.xml (or
>> > >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
>> > >> series of plugins that could be used for obtaining back-end
>> > >> authentication
>> > information.
>> > >>
>> > >> If you're good with LCF, perhaps we could work together to build
>> > >> this
>> > in.
>> > >> One of the first things would be defining an interface that would
>> > >> be as easy as possible to plug LCF into. Have you any
>> > >> suggestions/insight on this front?
>> > >>
>> > >> Many thanks,
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> SOLR-1872 looks exactly like what I was envisioning, from the
>> > >> search query perspective, although instead of the acl xml file you
>> > >> specify LCF stipulates you would dynamically query the
>> > >> lcf-authority-service servlet for the access tokens themselves.
>> > >> That would get you support for AD, Documentum, LiveLink, Meridio,
>> > >> and Memex for free. It seems likely that this component could be
>> > >> modified to work with LCF with minor
>> > effort.
>> > >>
>> > >> The missing component still seems to be AD authentication, which
>> > >> needs a solution.
>> > >>
>> > >> Karl
>> > >>
>> > >> ________________________________
>> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> > >> peter.sturge@googlemail.com>]
>> > >> Sent: Tuesday, April 20, 2010 10:44 AM
>> > >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> If you want to do this completely within Solr, have a look at:
>> > >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>> > >>
>> > >> Thanks,
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> FYI
>> > >>
>> > >> ________________________________
>> > >> From: Wright Karl (Nokia-S/Cambridge)
>> > >> Sent: Tuesday, April 20, 2010 8:16 AM
>> > >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
>> > >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
>> > >> connectors-dev@incubator.apache.org<mailto:
>> > >> connectors-dev@incubator.apache.org>'; '
>> > >> connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>'
>> > >> Subject: RE: Solr and LCF security at query time
>> > >>
>> > >> Dominique,
>> > >>
>> > >> Yes, I am aware of this ticket and contribution.  Luckily LCF
>> > >> establishes a powerful multi-repository security model, even though
>> > >> it doesn't yet do the final step of enforcing that model at the
>> > >> search end.  LCF allows you to define multiple authorities to
>> > >> operate against disparate repositories, and use the appropriate
>> > >> authority to secure any given document.  The solr people are aware
>> > >> of this design, which addresses the issues raised by SOLR-1834 very
>> > >> nicely.  However, as I said before, time is a problem, and the work
>> > >> still needs to be
>> > done.
>> > >>
>> > >> I suggest you read up on the actual security model of LCF, and
>> > >> perhaps experiment with that and the SOLR-1834 contribution, to see
>> > >> if there is common ground.  One thing we've learned at MetaCarta is
>> > >> that post-filtering for security purposes is expensive, and it is
>> > >> better to modify the queries themselves to restrict the results, if
>> > >> possible.  I'm not sure which approach SOLR-1834 takes, although it
>> > >> sounds like it might be the filtering approach.  Still, it would be
>> > better than nothing.
>> > >>
>> > >> Please let me know what you find out.
>> > >>
>> > >> Thanks,
>> > >> Karl
>> > >>
>> > >> ________________________________
>> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
>> > >> dominique.bejean@eolya.fr>]
>> > >> Sent: Tuesday, April 20, 2010 8:03 AM
>> > >> To: Wright Karl (Nokia-S/Cambridge)
>> > >> Cc: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>;
>> > >> connectors-dev@incubator.apache.org<mailto:
>> > >> connectors-dev@incubator.apache.org>
>> > >> Subject: Re: Solr and LCF security at query time
>> > >>
>> > >> Karl,
>> > >>
>> > >> Thank you for your reply.
>> > >>
>> > >> I made some research today and I found this :
>> > >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
>> > >> 83
>> > >> 4 http://demo.findwise.se:8880/SolrSecurity/
>> > >>
>> > >> Sorl security model have to be able to filter result list with
>> > >> items coming from various sources at the same time (livelink,
>> > >> documentum, file system, ...). Big subject :)
>> > >>
>> > >> Dominique
>> > >>
>> > >>
>> > >> Le 20/04/10 13:34,
>> > >> karl.wright@nokia.com<ma...@nokia.com> a ?crit :
>> > >> Hi Dominique,
>> > >>
>> > >> At the moment, in order to enforce the LCF security model within
>> > >> Lucene/Solr, you will need to build this functionality into
>> > >> whatever client you are using to display the Lucene search results.
>> > >> Specifically, you would need to take the following steps:
>> > >>
>> > >> (1) Have your users access your search client through Apache.
>> > >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>> > >> mod_authz_annotate, to cause authorization HTTP headers to be
>> > >> transmitted to the client webapp.
>> > >> (3) Have your client webapp alter whatever queries it is doing, to
>> > >> add an appropriate query clause for each of the access tokens
>> > >> transmitted in the headers.
>> > >>
>> > >> (This is how it is done at MetaCarta.)
>> > >>
>> > >> Alternatively, you may find a way to do this completely with a web
>> > >> application under a Java app server such as Tomcat.  I have not yet
>> > >> done the research to find out whether this is a feasible alternative.
>> > >> Effectively, what you need something like mod_auth_kerb to do is to
>> > >> authenticate your user against Active Directory, or whomever the
>> > authenticator ought to be.
>> > >>  JAAS may be helpful here.
>> > >>
>> > >> There are, of course, intentions to fill out the missing pieces
>> > >> more completely and transparently via a Solr search plugin and/or
>> filter.
>> > >> What has been lacking is time.  If you are in a position to do
>> > >> development in this area, we're happy to have any assistance you
>> > >> might
>> > provide.
>> > >>
>> > >> Thanks,
>> > >> Karl
>> > >> ________________________________
>> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> > >> Sent: Tuesday, April 20, 2010 5:06 AM
>> > >> To: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>
>> > >>  Subject: Solr and LCF security at query time
>> > >>
>> > >> Hi,
>> > >>
>> > >> I don't see in LCF wiki how Solr and LCF works together at query
>> > >> time in order to remove from the result list the items the user is
>> > >> not allowed to access.
>> > >>
>> > >> In
>> > >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
>> > >> ep
>> > >> ts.html,
>> > >> I just see these sentences :
>> > >>
>> > >> " Once all these documents and their access tokens are handed to
>> > >> the search engine, it is the search engine's job to enforce
>> > >> security by excluding inappropriate documents from the search
>> > >> results. For Lucene, this infrastructure is expected to be built on
>> > >> top of Lucene's generic metadata abilities, but has not been
>> > >> implemented at
>> > this time."
>> > >>
>> > >> I am not sure to understand. Does this mean that for the moment, it
>> > >> is not possible for Solr to apply security by using an Authority
>> > Connector ?
>> > >>
>> > >> Dominique
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> -------------------------------------------------------------------
>> > >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>> > >> additional commands, e-mail: dev-help@lucene.apache.org
>> > >>
>> > >
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>> > additional commands, e-mail: dev-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Yes, I don't doubt that using an external mechanism such as AD lockout will
work for those and other environments. I guess it comes down to the
difference between bespoke consultancy-type solutions and general-purpose
product solutions, of which the requirements are often very different. For a
general Access Control solution integrated into Solr, assumptions on the
presence/type of such external controls can't, and should't be assumed. If
they are/must be assumed, one of the core reasons for adding the new
functionality is missing.

As a starting point, for a general purpose access control system, at least
the following questions need to be addressed:
   * What happens when access control needs to change?
   * What happens if access control needs to change often (e.g. more than
several times a day)?
   * Can the access control cope with multiple data source types, without
the need for custom code, including data with no attached acl information?
   * If I change my access control, how is 'offline' data affected? (e.g.
backed-up data)
   * Will the access control satisfy regulatory compliance specs on it own,
or is an external mechanism required?
      (currently, Solr requires an external mechanism, but so also the
proposed solution)

As you might have guessed, I've been down this road before, and the
productization of security control has many facets, and these, as a general
rule, need to be addressed differently in products than in site-specific
deployments - mainly because products can't assume the envinroment(s) they
will run in (e.g. Active Directory).

The good thing is, there is a good alternative - that is: to store access
control information separately from indexed data and separately from an
authority. To me, that's where the beauty of an LCF plugin architecture
lives. Then, the task is to provide the integration tools (and it sounds
like LCF is very well suited for this) to deliver the 'bridge' between
content and authorization. (as you quite rightly said, authentication is a
separate, albeit related, subject)

Thanks,
Peter




On Wed, Apr 28, 2010 at 12:46 PM, <ka...@nokia.com> wrote:

>  >>>>>>
> With regards schema extension, I believe we need to be very careful here,
> as requiring index-time storage of access control data will pose a problem
> for any use cases where the access control needs to change (maybe often,
> maybe only occasionally). I'm trying to think of a use case where this
> wouldn't at least potentially be the case, and I can't think of one, but
> perhaps I'm not truly understanding what exactly is stored in the
> __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl
> changes would fit in (e.g. let's say someone has left my organization, do I
> have to update documents to remove his/her access?).
> <<<<<<
>
> Usually the way this works is that the user's account is locked out so they
> can't log in.  The authority service picks up this change, and it therefore
> takes place immediately.
>
> Bear in mind that this particular model has been employed by MetaCarta for
> more than five years in the field with clients such as pretty near all the
> major oil companies, many U.S. government agencies, the U.S. military, etc.
> In that time we have not heard even one complaint about the security model.
>
> Karl
>
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Wednesday, April 28, 2010 7:18 AM
>
> *To:* dev@lucene.apache.org
> *Cc:* connectors-dev@incubator.apache.org;
> connectors-user@incubator.apache.org; lucene-dev@apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Apologies for the delayed reply. I've been away on business, and in the
> middle of a product release, so it's been a busy time...
>
> In response to your eariler questions:
>
> The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses,
> although the point at which these are done is slightly different.
>
> I think I am correct in my understanding that with filter queries, the
> results are filtered 'post-Lucene', but are separately (Solr) cached, so you
> get a hit on the first search, but then benefit from cached hits on
> subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are
> applied at the Lucene query directly, so don't have separate Solr caching.
> I've not benchmarked the two, so one or other might be slower/faster for
> various search scenarios.
>
> In any case, I believe either technique can be employed in either 1834 or
> 1872.
>
>
> With regards schema extension, I believe we need to be very careful here,
> as requiring index-time storage of access control data will pose a problem
> for any use cases where the access control needs to change (maybe often,
> maybe only occasionally). I'm trying to think of a use case where this
> wouldn't at least potentially be the case, and I can't think of one, but
> perhaps I'm not truly understanding what exactly is stored in the
> __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl
> changes would fit in (e.g. let's say someone has left my organization, do I
> have to update documents to remove his/her access?).
>
> Also, would such indexed tokens be entirely 'document-context-free'? I.e.
> Would the same type/format of tokens be used for data from different sources
> (e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be
> compatible with multiple and/or changing authorities (e.g. AD, documentum,
> LDAP, custom, etc.)?
>
> I like the idea of an LCF plugin to hold the acl data. I admit, I've not
> had enough time to look into how this might look at the moment, but it
> sounds like it could be a good way to hold generic (authority-agnostic) acl
> data, and [hopefully] not have to tie it to document data at index-time.
>
> I hope this makes sense, but if I've misunderstood the proposed mechanism,
> please correct me. Would the __ALLOW_TOKEN__ et al fields store, for
> example, SID information?
>
>
> Thanks,
> Peter
>
>
>
> On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com> wrote:
>
>> Ok, not hearing back from Peter, I've done some Solr research and written
>> some code that might work.  The approach I've taken is most similar to SOLR
>> 1834, other than the LCF-centric logic.  Hopefully there will be a chance to
>> try this out in a full end-to-end way  on the weekend, after which I will
>> submit it to the Solr team (where I think it most naturally would be built
>> and delivered).
>>
>> What it's going to need is either a static or dynamic schema addition to
>> define __ALLOW_TOKEN__document, __DENY_TOKEN__document,
>> __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be
>> string, multivalued fields (I think).  It would be great if these could be
>> made a default part of Solr; similarly, it would be good if the new search
>> component was predelivered with Solr and mentioned (even if commented out)
>> in the example solrconfig.xml file.  The only other thing that needs to be
>> done to hook up the search component is to include a configuration parameter
>> describing the base URL of the LCF authority service.  Plus, as I said
>> earlier, we still don't have a canned solution for authentication yet -
>> although I feel that will be straightforward.
>>
>> Comments welcome...
>> Karl
>>
>>
>> ________________________________________
>> From: Wright Karl (Nokia-S/Cambridge)
>> Sent: Tuesday, April 27, 2010 8:20 AM
>> To: connectors-dev@incubator.apache.org; dev@lucene.apache.org
>> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org
>> Subject: RE: FW: Solr and LCF security at query time
>>
>> Hi Peter,
>>
>> I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions
>> in detail, and have a couple of SOLR-related questions.
>>
>> Both contributions rely on a SearchComponent to work their magic.
>>  However, it also appears that each modifies the user query in a different
>> way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses
>> standard AND and OR filterquery clauses.  Both of them are constructed using
>> Solr FilterQuery objects.  Here are my questions:
>>
>> (1) I am not conversant enough with Solr yet to know the difference
>> between the different kinds of clause structure.  Do you know if there is a
>> difference?  For example, is there any possibility that AND/OR clauses can
>> permit documents to be seen that should not be seen?  (MUST and MUST_NOT
>> sound a lot more definite...)
>>
>> (2) Are Solr FilterQuery objects applied to constructing the query that
>> will be sent to Lucene?  Or are they applied by Solr after-the-fact to the
>> resultset?  Or, is it a combination of the two, depending on the details of
>> your actual filter clause?
>>
>> I also haven't heard much from you in the last week or so - have you
>> thought further about what you intend to do, and can you let me know whether
>> you are still interested in developing an LCF plugin for Solr?
>>
>> Thanks,
>> Karl
>>
>> -----Original Message-----
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> Sent: Thursday, April 22, 2010 12:23 PM
>> To: dev@lucene.apache.org
>> Cc: connectors-dev@incubator.apache.org;
>> connectors-user@incubator.apache.org; lucene-dev@apache.org
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> See inline...
>>
>> On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:
>>
>> > Hi Peter,
>> >
>> > The authority connectors don't perform authentication at this time.
>> > In fact, LCF has nothing to do with authentication at all - just
>> authorization.
>> >  The reason for this is because it is almost never the case that
>> > somebody wants to provide multiple credentials in order to be able see
>> their results.
>> >  Most enterprises who have multiple repositories authenticate against
>> > AD and then map AD user names to repository user names in order to
>> > access those repositories.  If you noted my earlier posts from this
>> > morning, you may have noted that I'm looking at recommending JAAS plus
>> > sun's kerb5 login module for handling the "authenticate against AD"
>> > case, which would cover some 95%+ of the real world authentication
>> needed out there.
>> >
>> >
>> I did read your earlier post regarding this, and I totally agree with you
>> - this is best handled 'upstream'. In fact, I use a JAAS plugin in other
>> places in the product (not Solr) for authentication.
>>
>>
>> >
>> > Yes, the idea is to store SIDs in solr at index time.  I don't know
>> > enough about solr to know what kinds of issues this might entail, but
>> > Lucene certainly has a model of metadata that's pretty flexible, so I
>> > don't think this would be difficult at all.  Eric Hatcher also seemed
>> > to confirm my suspicions that this would not be a problem.
>> >
>>
>> It's certainly not a problem to store this data in Solr. The problem is
>> more that you don't really *want* to store this data at index time.
>> There are lots of reasons for not wanting to 'hard-code' SID data with
>> documents in the index. Here's just a few:
>>  * What happens if/when you want to add explicit user access to some
>> [group of] documents ? (i.e. not via a group)
>>  * What happens if you need to revoke or change a user's or group's
>> access?
>>  * It's difficult to move/replicate the index to another domain
>>  * For AD, SIDs are generally not meant to be stored long term outside of
>> AD, as they can be changed (this doesn't happen often, but it can happen
>> after an AD rebuild, domain type upgrade, data recovery etc.)
>>
>> These and other senarios mean re-indexing the stored data. When the index
>> is huge, this is non-trivial (time-wise). There are not uncommon scenarios
>> where user/group access control can change multiple times in one day.
>>
>> There might be a way of storing acl data in a payload or similar, but I'm
>> not sure how that would work across millions of [arbitrarily grouped ]
>> documents (I'm not familiar enough with payloads to know if this would be a
>> good or bad idea).
>>
>>
>> >
>> > This is exactly why I think that we need to do the authentication
>> > upstream of the authority world.
>> >
>> >
>> Agreed.
>>
>>
>>
>> >
>> > If Solr handles arbitrary document metadata, then I think we could
>> > just use that feature.  But you know more about it than me, at this
>> > point.  It would be great to get an overview of potential ways of doing
>> this.
>> >
>> >
>> Payloads, maybe?
>>
>>
>> >
>> > For your particular task, it sounds like you are trying to read from
>> > NTFS and apply security after-the-fact with some acl specification
>> > file.  In that case, I'd write a repository connector that was based
>> > on the file system connector (already part of the stable of connectors
>> > for LCF) which reads ACL information from your acl.xml file.  Or, if
>> > you prefer a UI for specifying ACL information, you could extend the
>> > connector so that security is configured in the UI without having an
>> > external acl.xml file at all - which would be a nice addition to the
>> > existing file system connector.  (Repository connections and jobs are
>> > configured internally in LCF by XML documents stored in the database,
>> > so they can be arbitrarily structured.  I'm happy to help you figure
>> > out how to do this if this is what you decide to do.)
>> >
>> > For my particular requirements, there are no files -  the data is
>> > generated
>> from the network and stored. After the fact, there is no persistent
>> location of this data other than in Solr.
>>
>> Storing the acl info using the connector sounds very interesting. Could be
>> worth looking at in more details. Thanks!
>>
>>
>>
>> > I think we still need to add in the authentication piece to make this
>> > all work for you, so perhaps you can describe how you expect a user to
>> > interact with your system, so I can understand your design issues.
>> >
>> > Thanks,
>> > Karl
>> >
>> >
>>
>>
>>
>>
>>
>>
>>
>>
>> > -----Original Message-----
>> > From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> > Sent: Thursday, April 22, 2010 11:32 AM
>> > To: dev@lucene.apache.org
>> > Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> > connectors-dev@incubator.apache.org
>> > Subject: Re: FW: Solr and LCF security at query time
>> >
>> > Hi Karl,
>> >
>> > Thanks very much for your detailed explanation - really good!
>> >
>> > As I've thought through some of the implications, I've added comments
>> > below, so I hope they don't seem too jumbled...
>> >
>> > I suppose on the 'authority' side, it works kind of as I envisioned it
>> > would.
>> >
>> > For general Solr access control, there's two layers of security that
>> > need to be addressed:
>> >  1. Authentication - make sure the incoming query is from a valid
>> > user, and the passed-in credentials (hash, certificate etc.) are
>> > correct  2. Query filtering - potentially reduce the number/type of
>> > returned results based on the allow/deny metadata for the
>> > authenticated user
>> >
>> > I can see how the LCF auth connector works for 2., but can it do 1. as
>> > well?
>> > It would be good if this could somehow be integrated into any
>> > container (Tomcat/Jetty et al) authentication that might be configured
>> > (probably related to your previous post). I many ways, it could/should
>> > be that the Authority (AD) part of the connector should only be
>> > concerned with 1. and not 2. (see below).
>> >
>> > So, on the repository side, there is also an LCF connector that
>> > 'closes the loop' to provide the 'what is it I'm trying to control' side
>> of things.
>> > I understand that LCF doesn't do the mapping - it delegates this task
>> > to the caller, but provides both sides of the equation (authority &
>> > repository).
>> >
>> > >>>>>
>> > - Each file in DirectoryA will have the following
>> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890",
>> and "myAD:S-23-64-12345".
>> > - Each file in DirectoryB will have the following
>> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
>> > <<<<<
>> > I think this is the bit that is worrying me - is this storing the SIDs
>> > into Solr at document index time? This would be a problem for a whole
>> > load of reasons, but maybe I'm missing something here? (see below for
>> > a possible
>> > alternative)
>> >
>> > Basically, what I'm getting at here is that the allow/deny values need
>> > to be stored in one of three places:
>> >  1. In the authority (e.g. inside AD)
>> >  2. In the document metadata (index-time)  3. In external storage
>> > (e.g. acl.xml, NTFS etc.)
>> >
>> > 1. Extending AD is pretty much out, as this causes too many interop
>> > problems 2. 'Hard-coding' acl information in the index makes it
>> > non-portable, resistent to changes, etc.
>> > 3. acl.xml is coupled with a Solr instance, but is easily
>> > ported/replicated.
>> > Storing/retrieving acl information from the source (e.g. NTFS) is
>> > problematic, as the source may not be accessible (it may not even
>> exist).
>> >
>> > I believe 3. or a variant is the way to go on the repo side, which
>> > means the LCF Authority connector is mainly for Authentication (see
>> > above), which is what you want from AD et al integration.
>> > The problem that arises from 'pluggable' authentication is that, if
>> > you're not using a certificate, you have to start with a password, but
>> > the connector only has access to the password hash (unless the pwd is
>> > sent in the query url). I don't know of a way to confirm identities in
>> > AD using only the username and hash (AD does the hash compare). I
>> > believe this is where container-based integration will likely work
>> better.
>> >
>> > So that I can confirm my understanding...a scenario might be like this:
>> >
>> > We have an AD connector that fetches the SIDs and we can read them etc.
>> > For my environment, where there are no 'files' (there's only a
>> > transient network stream), we have an LCF 'Solr Field Filter Query'
>> > connector that decides which Filter Queries to apply (allow and deny)
>> > for the passed in SID(s).
>> >
>> > For another environment, let's say, NTFS, there might be an 'NTFS'
>> > connector that would provide some kind of mapping of files/folders to
>> > SID(s). Since Solr wouldn't intrinscially know about this, the acl
>> > information would need to be stored somewhere in the index. This would
>> > mean extending the Solr schema and storing metadata at index time.
>> > The alternative is to re-use the 'Solr Field Filter Query' connector
>> > for this as well (and any other document types that might be read in).
>> > This keeps the index 'clean' of acl-specific metadata, and allows for
>> > in-place changes and easy cross-document/index/instance access control.
>> >
>> >
>> > If the above interpretation is [roughly] correct (please let me know
>> > if I've got this wrong!), this would reduce down to having:
>> >   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
>> > (possibly/partly at the container level)
>> >   2. At least an LCF Repository connector for 'acl.xml'
>> >   3. Optional other LCF Repository connectors
>> >
>> > It sounds like you've now finished the first half of 1. by adding the
>> > ability to get the required auth data from a Solr api call. The other
>> > half of 1. will be implementing the LCF interface in the
>> > SolrACLSecurity class, to effectively replace the 'user', 'group' and
>> 'password' bits of acl.xml.
>> >
>> > Does the above sound like an accurate interpretation? Just trying to
>> > get a good picture of what work needs doing, where it goes, etc.
>> >
>> > Many thanks!
>> > Peter
>> >
>> >
>> >
>> >
>> > On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>> >
>> > >  >>>>>>
>> > > What is the relationship between stored data (documents) and
>> authorities'
>> > > access/deny attributes? (do you have any examples of what an
>> > > access_token value might contain?) <<<<<<
>> > >
>> > > Documents have access/deny attributes; authorities simply provide
>> > > the list of tokens that belong to an authenticated user.  Thus,
>> > > there's no access/deny for an authority; that's attached to the
>> > > document (as it is in real-world repositories).
>> > >
>> > > Let's run a quick example, using Active Directory and a Windows file
>> > > system.  Suppose that you have a directory with documents in it,
>> > > call it DirectoryA, and the directory allows read access to the
>> > > following
>> > SIDs:
>> > >
>> > > S-123-456-76890
>> > > S-23-64-12345
>> > >
>> > > These SIDs correspond to active directory groups, let's call them
>> > > Group1 and Group2, respectively.
>> > >
>> > > DirectoryB also has documents in it, and those documents have just
>> > > the SID S-123-456-76890 attached, because only Group1 can read its
>> contents.
>> > >
>> > > Now, pretend that someone has created an LCF Active Directory
>> > > authority connection (in the LCF UI), which is called "myAD", and
>> > > this connection is set up to talk to the governing AD domain
>> > > controller for this Windows file system.  We now know enough to
>> > > describe the document
>> > indexing process:
>> > >
>> > > - Each file in DirectoryA will have the following
>> > > __ALLOW_TOKEN__document attributes inside Solr:
>> > > "myAD:S-123-456-76890",
>> > and "myAD:S-23-64-12345".
>> > > - Each file in DirectoryB will have the following
>> > > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
>> > >
>> > > Now, suppose that a user (let's call him "Peter") is authenticated
>> > > with the AD domain controller.  Peter belongs to Group2, so his SIDs
>> > > are
>> > (say):
>> > >
>> > > S-1-1-0 (the 'everyone' SID)
>> > > S-323-999-12345 (his own personal user SID)
>> > > S-23-64-12345 (the SID he gets because he belongs to group 2)
>> > >
>> > > We want to look up the documents in the search index that he can see.
>> > > So, we ask the LCF authority service what his tokens are, and we get
>> > back:
>> > >
>> > > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
>> > >
>> > > The documents we should return in his search are the ones matching
>> > > his search criteria, PLUS the intersection of his tokens with the
>> > > document ALLOW tokens, MINUS the intersection of his tokens with the
>> > > document DENY tokens (there aren't any involved in this example).
>> > > So only files that have one of his three tokens as an ALLOW
>> > > attribute would be
>> > returned.
>> > >
>> > > Note that what we are attempting to do is enforce AD's security with
>> > > the search results we present.  There is no need to define a whole
>> > > new security mechanism, because AD already has one that people use.
>> > >
>> > > >>>>>>
>> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
>> > > to ensure there are no security or other dependencies of indexed
>> > > data with any external repository - most notably the file system.
>> > > There are many reasons for wanting this, but one of the main ones is
>> > > that Solr-stored data is not always based on file data (or
>> > > accessible
>> > file data).
>> > > In fact, in my particular case, almost none of the indexed data
>> > > comes from files.
>> > > <<<<<<
>> > >
>> > > LCF is all about abstracting from repositories.  It's not
>> > > specifically about a file system, although that is a convenient
>> > > example.  If you are building your own kind of repository with your
>> > > own security setup, that's fine - but in the LCF world you'd need to
>> > > create an authority connector for your repository (which maybe reads
>> > > your acl.xml file), as well as a repository connector (which hands
>> > > documents to LCF and provides it with the access tokens that make
>> > > security work).  Of course, you can something much lighter that
>> > > doesn't include LCF at all if you are just integrating a custom
>> > > repository of your own, but it sounded like you were interested in the
>> broader problem here.
>> > >
>> > > So, LCF doesn't do "acl mapping" at all.  It relies on its various
>> > > connectors to work cooperatively to define access tokens in a way
>> > > that is consistent from authority connector to repository connector
>> > > for a given repository kind.  Anybody can write a connector, so the
>> > > beauty of all this is that you can build a system where data from
>> > > many disparate sources is indexed, and security for each is
>> > > simultaneously
>> > enforced.
>> > >
>> > > Karl
>> > >
>> > >
>> > >  ------------------------------
>> > > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> > > *Sent:* Thursday, April 22, 2010 9:24 AM
>> > >
>> > > *To:* dev@lucene.apache.org
>> > > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> > > connectors-dev@incubator.apache.org
>> > > *Subject:* Re: FW: Solr and LCF security at query time
>> > >
>> > > Hi Karl,
>> > >
>> > > Thanks very much for the diagram -
>> > > Sorry about all the questions, but this raises a few new ones...
>> > >
>> > > What is the relationship between stored data (documents) and
>> authorities'
>> > > access/deny attributes? (do you have any examples of what an
>> > > access_token value might contain?)
>> > >
>> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
>> > > to ensure there are no security or other dependencies of indexed
>> > > data with any external repository - most notably the file system.
>> > > There are many reasons for wanting this, but one of the main ones is
>> > > that Solr-stored data is not always based on file data (or
>> > > accessible
>> > file data).
>> > > In fact, in my particular case, almost none of the indexed data
>> > > comes from files.
>> > >
>> > > This is one reason why SOLR-1872 uses filter queries for its
>> > > access/deny tokens - so that all the required information for access
>> > > control completely resides within the Solr index itself.
>> > > Is the LCF architecture acl 'mapping' between Solr fields (queries)
>> > > and users, some external 'repository' (files) and users, or
>> > > arbitrary
>> > data (e.g.
>> > > either of these)?
>> > >
>> > > I hope that makes sense...
>> > >
>> > > Thanks!
>> > > Peter
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
>> > >
>> > >> Hi Peter,
>> > >>
>> > >> I've attached a diagram that is not in the wiki as of yet, and I'll
>> > >> try to answer your questions.
>> > >>
>> > >> >>>>>>
>> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
>> > >> stored for a particular user in the underlying acl store (e.g.
>> > >> Active
>> > Directory)?
>> > >> How does AD and/or LCF handle storing such data in its schema?
>> > >> (does AD needs its schema extended?) Presumably, any such AD fields
>> > >> would need to be queried for effective rights in order to cater for
>> > >> group membership allows and denies.
>> > >> <<<<<<
>> > >>
>> > >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
>> > >> strings that represent a contract between an LCF authority
>> > >> connection and the LCF repository connection that picks up the
>> > >> documents (from
>> > wherever).
>> > >>  These tokens thus have no real meaning outside of LCF.  You must
>> > >> regard them as opaque.
>> > >>
>> > >> The contract, however, states that if you use the LCF authority
>> > >> service to obtain tokens for an authenticated user, you will get
>> > >> back a set that is CONSISTENT with the tokens that were attached to
>> > >> the documents LCF sent to Solr for indexing in the first place.
>> > >> So, you don't have to worry about it, and that's kind of the idea.
>> > >> So you
>> > imagine the following flow:
>> > >>
>> > >> (1) Use LCF to fetch documents and send them to Solr
>> > >> (2) When searching, use the LCF authority service to get the
>> > >> desired user's access tokens
>> > >> (3) Either filter the results, or modify the query, to be sure the
>> > >> access tokens all match up properly
>> > >>
>> > >> For the AD authority, the LCF access tokens consist, in part, of
>> > >> the user's SIDs.  For other authorities, the access tokens are
>> > >> wildly
>> > different.
>> > >>  You really don't want to know what's in them, since that's the job
>> > >> of the LCF authority to determine. ;-)
>> > >>
>> > >> LCF is not, by the way, joined at the hip with AD.  However, in
>> > >> practice, most enterprises in the world use some form of AD single
>> > >> signon for their web applications, and even if they're using some
>> > >> repository with its own idea of security, there's a mapping between
>> > >> the AD users and the repository's users.  Doing that mapping is
>> > >> also the job of the LCF authority for that repository.
>> > >>
>> > >> Hope this helps.  Also, I'm not expecting time miracles here, so
>> > >> don't sweat the schedule.
>> > >>
>> > >>
>> > >> Karl
>> > >>
>> > >>
>> > >> ________________________________________
>> > >> From: ext Peter Sturge [peter.sturge@googlemail.com]
>> > >> Sent: Thursday, April 22, 2010 4:27 AM
>> > >> To: dev@lucene.apache.org
>> > >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> > >> connectors-dev@incubator.apache.org
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> Hi Karl,
>> > >>
>> > >> Thanks for the quick turnaround.
>> > >> I'm in the middle of a product release for us, so I fear I won't be
>> > >> as quick as you... :-)
>> > >>
>> > >> I couldn't find a simple flow diagram or similar for LCF with
>> > >> regards security (probably looking in the wrong place).
>> > >> Perhaps you could help on these questions...?
>> > >>
>> > >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
>> > >> sub-queries, which are then used as filter queries in a user's
>> search.
>> > >>
>> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
>> > >> stored for a particular user in the underlying acl store (e.g.
>> > >> Active
>> > Directory)?
>> > >> How does AD and/or LCF handle storing such data in its schema?
>> > >> (does AD needs its schema extended?) Presumably, any such AD fields
>> > >> would need to be queried for effective rights in order to cater for
>> > >> group membership allows and denies.
>> > >>
>> > >> I guess I'm just trying to understand the architectural
>> > >> flow/storage/retrieval of data in the various parts of the system,
>> > >> but I admit, I need to do more research on this.
>> > >> After our product release, when I get a few more spare cycles, I
>> > >> can look at it in more detail.
>> > >>
>> > >> Many thanks!
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> Hi Peter,
>> > >>
>> > >> I just committed the promised changes to the LCF Solr output
>> connector.
>> > >>
>> > >> ACL metadata will now be posted to the Solr Http interface along
>> > >> with the document as the two following fields:
>> > >>
>> > >> __ACCESS_TOKEN__document
>> > >> __DENY_TOKEN__document
>> > >>
>> > >> There will, of course, potentially be multiple values for each of
>> > >> these two fields.
>> > >>
>> > >> Hope this helps,
>> > >> Karl
>> > >>
>> > >> ________________________________
>> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> > >> peter.sturge@googlemail.com>]
>> > >> Sent: Tuesday, April 20, 2010 6:51 PM
>> > >>
>> > >> To: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> Hi Karl,
>> > >>
>> > >> Thanks for the info. I'll have a look at the link and try to take
>> > >> in as much sugar as my insulin levels will handle...
>> > >> It sounds like the necessary interface(s) are already in LCF - just
>> > >> a matter of implementing them in the Solr 1872 plugin.
>> > >> I'll need to digest the LCF stuff to get to grips with it..please
>> > >> bear with me while I do that...
>> > >>
>> > >> When you say:
>> > >>   The LCF solr output connection doesn't yet do this, but it is
>> > >> trivial for me to make that happen.
>> > >> Do you mean a mechanism by which solr.war can get url et al info
>> > >> from its parent container (Tomcat, Jetty etc.), or have I
>> > >> misinterpreted
>> > this?
>> > >>
>> > >>
>> > >> Thanks,
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> Hi Peter,
>> > >>
>> > >> I'm the principal committer for LCF, but I don't know as much about
>> > >> Solr as I ought to, so it sounds like a potentially productive
>> > collaboration.
>> > >>
>> > >> LCF does exactly what you are looking for - the only issue at all
>> > >> is that you need to fetch a URL from a webapp to get what you are
>> > >> looking for.  The "plugs" are all inside LCF for different kinds of
>> > >> repositories.  Here's a link that might help with drinking the LCF
>> > "koolaid", as it were:
>> > >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
>> > >> ct
>> > >> ors+Framework+concepts
>> > >>
>> > >> The url would be something like this (on a locally installed
>> > >> tomcat-based LCF instance):
>> > >>
>> > >>
>> > >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
>> > >> se
>> > >> rname@somedomain.com
>> > >>
>> > >> ... and this fetch returns something like:
>> > >>
>> > >> TOKEN:xxxxxxx
>> > >> TOKEN:yyyyyyy
>> > >> TOKEN:zzzzzzz
>> > >> ....
>> > >>
>> > >> ... which represent the amalgamated tokens for all of the defined
>> > >> authorities, and by some strange coincidence ( ;-) ) are compatible
>> > >> with certain pieces of metadata that have been passed into Solr
>> > >> with each document - one set of Allow tokens, and a second set of
>> > >> Deny tokens.  The LCF solr output connection doesn't yet do this,
>> > >> but it is trivial for me to make that happen.
>> > >>
>> > >> Does this sound plausible to you?
>> > >>
>> > >> Karl
>> > >>
>> > >>
>> > >> ________________________________
>> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> > >> peter.sturge@googlemail.com>]
>> > >> Sent: Tuesday, April 20, 2010 5:41 PM
>> > >> To: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
>> > >> dev@lucene.apache.org>
>> > >>
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> Hi Karl,
>> > >>
>> > >> Integrating LCF to get external token support for SOLR-1872 sounds
>> > >> very interesting indeed. I don't know anything about LCF, but one
>> > >> of the things I was planning for SOLR-1872 is to make acl.xml (or
>> > >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
>> > >> series of plugins that could be used for obtaining back-end
>> > >> authentication
>> > information.
>> > >>
>> > >> If you're good with LCF, perhaps we could work together to build
>> > >> this
>> > in.
>> > >> One of the first things would be defining an interface that would
>> > >> be as easy as possible to plug LCF into. Have you any
>> > >> suggestions/insight on this front?
>> > >>
>> > >> Many thanks,
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> SOLR-1872 looks exactly like what I was envisioning, from the
>> > >> search query perspective, although instead of the acl xml file you
>> > >> specify LCF stipulates you would dynamically query the
>> > >> lcf-authority-service servlet for the access tokens themselves.
>> > >> That would get you support for AD, Documentum, LiveLink, Meridio,
>> > >> and Memex for free. It seems likely that this component could be
>> > >> modified to work with LCF with minor
>> > effort.
>> > >>
>> > >> The missing component still seems to be AD authentication, which
>> > >> needs a solution.
>> > >>
>> > >> Karl
>> > >>
>> > >> ________________________________
>> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> > >> peter.sturge@googlemail.com>]
>> > >> Sent: Tuesday, April 20, 2010 10:44 AM
>> > >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> If you want to do this completely within Solr, have a look at:
>> > >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>> > >>
>> > >> Thanks,
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> FYI
>> > >>
>> > >> ________________________________
>> > >> From: Wright Karl (Nokia-S/Cambridge)
>> > >> Sent: Tuesday, April 20, 2010 8:16 AM
>> > >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
>> > >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
>> > >> connectors-dev@incubator.apache.org<mailto:
>> > >> connectors-dev@incubator.apache.org>'; '
>> > >> connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>'
>> > >> Subject: RE: Solr and LCF security at query time
>> > >>
>> > >> Dominique,
>> > >>
>> > >> Yes, I am aware of this ticket and contribution.  Luckily LCF
>> > >> establishes a powerful multi-repository security model, even though
>> > >> it doesn't yet do the final step of enforcing that model at the
>> > >> search end.  LCF allows you to define multiple authorities to
>> > >> operate against disparate repositories, and use the appropriate
>> > >> authority to secure any given document.  The solr people are aware
>> > >> of this design, which addresses the issues raised by SOLR-1834 very
>> > >> nicely.  However, as I said before, time is a problem, and the work
>> > >> still needs to be
>> > done.
>> > >>
>> > >> I suggest you read up on the actual security model of LCF, and
>> > >> perhaps experiment with that and the SOLR-1834 contribution, to see
>> > >> if there is common ground.  One thing we've learned at MetaCarta is
>> > >> that post-filtering for security purposes is expensive, and it is
>> > >> better to modify the queries themselves to restrict the results, if
>> > >> possible.  I'm not sure which approach SOLR-1834 takes, although it
>> > >> sounds like it might be the filtering approach.  Still, it would be
>> > better than nothing.
>> > >>
>> > >> Please let me know what you find out.
>> > >>
>> > >> Thanks,
>> > >> Karl
>> > >>
>> > >> ________________________________
>> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
>> > >> dominique.bejean@eolya.fr>]
>> > >> Sent: Tuesday, April 20, 2010 8:03 AM
>> > >> To: Wright Karl (Nokia-S/Cambridge)
>> > >> Cc: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>;
>> > >> connectors-dev@incubator.apache.org<mailto:
>> > >> connectors-dev@incubator.apache.org>
>> > >> Subject: Re: Solr and LCF security at query time
>> > >>
>> > >> Karl,
>> > >>
>> > >> Thank you for your reply.
>> > >>
>> > >> I made some research today and I found this :
>> > >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
>> > >> 83
>> > >> 4 http://demo.findwise.se:8880/SolrSecurity/
>> > >>
>> > >> Sorl security model have to be able to filter result list with
>> > >> items coming from various sources at the same time (livelink,
>> > >> documentum, file system, ...). Big subject :)
>> > >>
>> > >> Dominique
>> > >>
>> > >>
>> > >> Le 20/04/10 13:34,
>> > >> karl.wright@nokia.com<ma...@nokia.com> a ?crit :
>> > >> Hi Dominique,
>> > >>
>> > >> At the moment, in order to enforce the LCF security model within
>> > >> Lucene/Solr, you will need to build this functionality into
>> > >> whatever client you are using to display the Lucene search results.
>> > >> Specifically, you would need to take the following steps:
>> > >>
>> > >> (1) Have your users access your search client through Apache.
>> > >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>> > >> mod_authz_annotate, to cause authorization HTTP headers to be
>> > >> transmitted to the client webapp.
>> > >> (3) Have your client webapp alter whatever queries it is doing, to
>> > >> add an appropriate query clause for each of the access tokens
>> > >> transmitted in the headers.
>> > >>
>> > >> (This is how it is done at MetaCarta.)
>> > >>
>> > >> Alternatively, you may find a way to do this completely with a web
>> > >> application under a Java app server such as Tomcat.  I have not yet
>> > >> done the research to find out whether this is a feasible alternative.
>> > >> Effectively, what you need something like mod_auth_kerb to do is to
>> > >> authenticate your user against Active Directory, or whomever the
>> > authenticator ought to be.
>> > >>  JAAS may be helpful here.
>> > >>
>> > >> There are, of course, intentions to fill out the missing pieces
>> > >> more completely and transparently via a Solr search plugin and/or
>> filter.
>> > >> What has been lacking is time.  If you are in a position to do
>> > >> development in this area, we're happy to have any assistance you
>> > >> might
>> > provide.
>> > >>
>> > >> Thanks,
>> > >> Karl
>> > >> ________________________________
>> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> > >> Sent: Tuesday, April 20, 2010 5:06 AM
>> > >> To: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>
>> > >>  Subject: Solr and LCF security at query time
>> > >>
>> > >> Hi,
>> > >>
>> > >> I don't see in LCF wiki how Solr and LCF works together at query
>> > >> time in order to remove from the result list the items the user is
>> > >> not allowed to access.
>> > >>
>> > >> In
>> > >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
>> > >> ep
>> > >> ts.html,
>> > >> I just see these sentences :
>> > >>
>> > >> " Once all these documents and their access tokens are handed to
>> > >> the search engine, it is the search engine's job to enforce
>> > >> security by excluding inappropriate documents from the search
>> > >> results. For Lucene, this infrastructure is expected to be built on
>> > >> top of Lucene's generic metadata abilities, but has not been
>> > >> implemented at
>> > this time."
>> > >>
>> > >> I am not sure to understand. Does this mean that for the moment, it
>> > >> is not possible for Solr to apply security by using an Authority
>> > Connector ?
>> > >>
>> > >> Dominique
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> -------------------------------------------------------------------
>> > >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>> > >> additional commands, e-mail: dev-help@lucene.apache.org
>> > >>
>> > >
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>> > additional commands, e-mail: dev-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Yes, I don't doubt that using an external mechanism such as AD lockout will
work for those and other environments. I guess it comes down to the
difference between bespoke consultancy-type solutions and general-purpose
product solutions, of which the requirements are often very different. For a
general Access Control solution integrated into Solr, assumptions on the
presence/type of such external controls can't, and should't be assumed. If
they are/must be assumed, one of the core reasons for adding the new
functionality is missing.

As a starting point, for a general purpose access control system, at least
the following questions need to be addressed:
   * What happens when access control needs to change?
   * What happens if access control needs to change often (e.g. more than
several times a day)?
   * Can the access control cope with multiple data source types, without
the need for custom code, including data with no attached acl information?
   * If I change my access control, how is 'offline' data affected? (e.g.
backed-up data)
   * Will the access control satisfy regulatory compliance specs on it own,
or is an external mechanism required?
      (currently, Solr requires an external mechanism, but so also the
proposed solution)

As you might have guessed, I've been down this road before, and the
productization of security control has many facets, and these, as a general
rule, need to be addressed differently in products than in site-specific
deployments - mainly because products can't assume the envinroment(s) they
will run in (e.g. Active Directory).

The good thing is, there is a good alternative - that is: to store access
control information separately from indexed data and separately from an
authority. To me, that's where the beauty of an LCF plugin architecture
lives. Then, the task is to provide the integration tools (and it sounds
like LCF is very well suited for this) to deliver the 'bridge' between
content and authorization. (as you quite rightly said, authentication is a
separate, albeit related, subject)

Thanks,
Peter




On Wed, Apr 28, 2010 at 12:46 PM, <ka...@nokia.com> wrote:

>  >>>>>>
> With regards schema extension, I believe we need to be very careful here,
> as requiring index-time storage of access control data will pose a problem
> for any use cases where the access control needs to change (maybe often,
> maybe only occasionally). I'm trying to think of a use case where this
> wouldn't at least potentially be the case, and I can't think of one, but
> perhaps I'm not truly understanding what exactly is stored in the
> __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl
> changes would fit in (e.g. let's say someone has left my organization, do I
> have to update documents to remove his/her access?).
> <<<<<<
>
> Usually the way this works is that the user's account is locked out so they
> can't log in.  The authority service picks up this change, and it therefore
> takes place immediately.
>
> Bear in mind that this particular model has been employed by MetaCarta for
> more than five years in the field with clients such as pretty near all the
> major oil companies, many U.S. government agencies, the U.S. military, etc.
> In that time we have not heard even one complaint about the security model.
>
> Karl
>
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Wednesday, April 28, 2010 7:18 AM
>
> *To:* dev@lucene.apache.org
> *Cc:* connectors-dev@incubator.apache.org;
> connectors-user@incubator.apache.org; lucene-dev@apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Apologies for the delayed reply. I've been away on business, and in the
> middle of a product release, so it's been a busy time...
>
> In response to your eariler questions:
>
> The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses,
> although the point at which these are done is slightly different.
>
> I think I am correct in my understanding that with filter queries, the
> results are filtered 'post-Lucene', but are separately (Solr) cached, so you
> get a hit on the first search, but then benefit from cached hits on
> subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are
> applied at the Lucene query directly, so don't have separate Solr caching.
> I've not benchmarked the two, so one or other might be slower/faster for
> various search scenarios.
>
> In any case, I believe either technique can be employed in either 1834 or
> 1872.
>
>
> With regards schema extension, I believe we need to be very careful here,
> as requiring index-time storage of access control data will pose a problem
> for any use cases where the access control needs to change (maybe often,
> maybe only occasionally). I'm trying to think of a use case where this
> wouldn't at least potentially be the case, and I can't think of one, but
> perhaps I'm not truly understanding what exactly is stored in the
> __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl
> changes would fit in (e.g. let's say someone has left my organization, do I
> have to update documents to remove his/her access?).
>
> Also, would such indexed tokens be entirely 'document-context-free'? I.e.
> Would the same type/format of tokens be used for data from different sources
> (e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be
> compatible with multiple and/or changing authorities (e.g. AD, documentum,
> LDAP, custom, etc.)?
>
> I like the idea of an LCF plugin to hold the acl data. I admit, I've not
> had enough time to look into how this might look at the moment, but it
> sounds like it could be a good way to hold generic (authority-agnostic) acl
> data, and [hopefully] not have to tie it to document data at index-time.
>
> I hope this makes sense, but if I've misunderstood the proposed mechanism,
> please correct me. Would the __ALLOW_TOKEN__ et al fields store, for
> example, SID information?
>
>
> Thanks,
> Peter
>
>
>
> On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com> wrote:
>
>> Ok, not hearing back from Peter, I've done some Solr research and written
>> some code that might work.  The approach I've taken is most similar to SOLR
>> 1834, other than the LCF-centric logic.  Hopefully there will be a chance to
>> try this out in a full end-to-end way  on the weekend, after which I will
>> submit it to the Solr team (where I think it most naturally would be built
>> and delivered).
>>
>> What it's going to need is either a static or dynamic schema addition to
>> define __ALLOW_TOKEN__document, __DENY_TOKEN__document,
>> __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be
>> string, multivalued fields (I think).  It would be great if these could be
>> made a default part of Solr; similarly, it would be good if the new search
>> component was predelivered with Solr and mentioned (even if commented out)
>> in the example solrconfig.xml file.  The only other thing that needs to be
>> done to hook up the search component is to include a configuration parameter
>> describing the base URL of the LCF authority service.  Plus, as I said
>> earlier, we still don't have a canned solution for authentication yet -
>> although I feel that will be straightforward.
>>
>> Comments welcome...
>> Karl
>>
>>
>> ________________________________________
>> From: Wright Karl (Nokia-S/Cambridge)
>> Sent: Tuesday, April 27, 2010 8:20 AM
>> To: connectors-dev@incubator.apache.org; dev@lucene.apache.org
>> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org
>> Subject: RE: FW: Solr and LCF security at query time
>>
>> Hi Peter,
>>
>> I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions
>> in detail, and have a couple of SOLR-related questions.
>>
>> Both contributions rely on a SearchComponent to work their magic.
>>  However, it also appears that each modifies the user query in a different
>> way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses
>> standard AND and OR filterquery clauses.  Both of them are constructed using
>> Solr FilterQuery objects.  Here are my questions:
>>
>> (1) I am not conversant enough with Solr yet to know the difference
>> between the different kinds of clause structure.  Do you know if there is a
>> difference?  For example, is there any possibility that AND/OR clauses can
>> permit documents to be seen that should not be seen?  (MUST and MUST_NOT
>> sound a lot more definite...)
>>
>> (2) Are Solr FilterQuery objects applied to constructing the query that
>> will be sent to Lucene?  Or are they applied by Solr after-the-fact to the
>> resultset?  Or, is it a combination of the two, depending on the details of
>> your actual filter clause?
>>
>> I also haven't heard much from you in the last week or so - have you
>> thought further about what you intend to do, and can you let me know whether
>> you are still interested in developing an LCF plugin for Solr?
>>
>> Thanks,
>> Karl
>>
>> -----Original Message-----
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> Sent: Thursday, April 22, 2010 12:23 PM
>> To: dev@lucene.apache.org
>> Cc: connectors-dev@incubator.apache.org;
>> connectors-user@incubator.apache.org; lucene-dev@apache.org
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> See inline...
>>
>> On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:
>>
>> > Hi Peter,
>> >
>> > The authority connectors don't perform authentication at this time.
>> > In fact, LCF has nothing to do with authentication at all - just
>> authorization.
>> >  The reason for this is because it is almost never the case that
>> > somebody wants to provide multiple credentials in order to be able see
>> their results.
>> >  Most enterprises who have multiple repositories authenticate against
>> > AD and then map AD user names to repository user names in order to
>> > access those repositories.  If you noted my earlier posts from this
>> > morning, you may have noted that I'm looking at recommending JAAS plus
>> > sun's kerb5 login module for handling the "authenticate against AD"
>> > case, which would cover some 95%+ of the real world authentication
>> needed out there.
>> >
>> >
>> I did read your earlier post regarding this, and I totally agree with you
>> - this is best handled 'upstream'. In fact, I use a JAAS plugin in other
>> places in the product (not Solr) for authentication.
>>
>>
>> >
>> > Yes, the idea is to store SIDs in solr at index time.  I don't know
>> > enough about solr to know what kinds of issues this might entail, but
>> > Lucene certainly has a model of metadata that's pretty flexible, so I
>> > don't think this would be difficult at all.  Eric Hatcher also seemed
>> > to confirm my suspicions that this would not be a problem.
>> >
>>
>> It's certainly not a problem to store this data in Solr. The problem is
>> more that you don't really *want* to store this data at index time.
>> There are lots of reasons for not wanting to 'hard-code' SID data with
>> documents in the index. Here's just a few:
>>  * What happens if/when you want to add explicit user access to some
>> [group of] documents ? (i.e. not via a group)
>>  * What happens if you need to revoke or change a user's or group's
>> access?
>>  * It's difficult to move/replicate the index to another domain
>>  * For AD, SIDs are generally not meant to be stored long term outside of
>> AD, as they can be changed (this doesn't happen often, but it can happen
>> after an AD rebuild, domain type upgrade, data recovery etc.)
>>
>> These and other senarios mean re-indexing the stored data. When the index
>> is huge, this is non-trivial (time-wise). There are not uncommon scenarios
>> where user/group access control can change multiple times in one day.
>>
>> There might be a way of storing acl data in a payload or similar, but I'm
>> not sure how that would work across millions of [arbitrarily grouped ]
>> documents (I'm not familiar enough with payloads to know if this would be a
>> good or bad idea).
>>
>>
>> >
>> > This is exactly why I think that we need to do the authentication
>> > upstream of the authority world.
>> >
>> >
>> Agreed.
>>
>>
>>
>> >
>> > If Solr handles arbitrary document metadata, then I think we could
>> > just use that feature.  But you know more about it than me, at this
>> > point.  It would be great to get an overview of potential ways of doing
>> this.
>> >
>> >
>> Payloads, maybe?
>>
>>
>> >
>> > For your particular task, it sounds like you are trying to read from
>> > NTFS and apply security after-the-fact with some acl specification
>> > file.  In that case, I'd write a repository connector that was based
>> > on the file system connector (already part of the stable of connectors
>> > for LCF) which reads ACL information from your acl.xml file.  Or, if
>> > you prefer a UI for specifying ACL information, you could extend the
>> > connector so that security is configured in the UI without having an
>> > external acl.xml file at all - which would be a nice addition to the
>> > existing file system connector.  (Repository connections and jobs are
>> > configured internally in LCF by XML documents stored in the database,
>> > so they can be arbitrarily structured.  I'm happy to help you figure
>> > out how to do this if this is what you decide to do.)
>> >
>> > For my particular requirements, there are no files -  the data is
>> > generated
>> from the network and stored. After the fact, there is no persistent
>> location of this data other than in Solr.
>>
>> Storing the acl info using the connector sounds very interesting. Could be
>> worth looking at in more details. Thanks!
>>
>>
>>
>> > I think we still need to add in the authentication piece to make this
>> > all work for you, so perhaps you can describe how you expect a user to
>> > interact with your system, so I can understand your design issues.
>> >
>> > Thanks,
>> > Karl
>> >
>> >
>>
>>
>>
>>
>>
>>
>>
>>
>> > -----Original Message-----
>> > From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> > Sent: Thursday, April 22, 2010 11:32 AM
>> > To: dev@lucene.apache.org
>> > Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> > connectors-dev@incubator.apache.org
>> > Subject: Re: FW: Solr and LCF security at query time
>> >
>> > Hi Karl,
>> >
>> > Thanks very much for your detailed explanation - really good!
>> >
>> > As I've thought through some of the implications, I've added comments
>> > below, so I hope they don't seem too jumbled...
>> >
>> > I suppose on the 'authority' side, it works kind of as I envisioned it
>> > would.
>> >
>> > For general Solr access control, there's two layers of security that
>> > need to be addressed:
>> >  1. Authentication - make sure the incoming query is from a valid
>> > user, and the passed-in credentials (hash, certificate etc.) are
>> > correct  2. Query filtering - potentially reduce the number/type of
>> > returned results based on the allow/deny metadata for the
>> > authenticated user
>> >
>> > I can see how the LCF auth connector works for 2., but can it do 1. as
>> > well?
>> > It would be good if this could somehow be integrated into any
>> > container (Tomcat/Jetty et al) authentication that might be configured
>> > (probably related to your previous post). I many ways, it could/should
>> > be that the Authority (AD) part of the connector should only be
>> > concerned with 1. and not 2. (see below).
>> >
>> > So, on the repository side, there is also an LCF connector that
>> > 'closes the loop' to provide the 'what is it I'm trying to control' side
>> of things.
>> > I understand that LCF doesn't do the mapping - it delegates this task
>> > to the caller, but provides both sides of the equation (authority &
>> > repository).
>> >
>> > >>>>>
>> > - Each file in DirectoryA will have the following
>> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890",
>> and "myAD:S-23-64-12345".
>> > - Each file in DirectoryB will have the following
>> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
>> > <<<<<
>> > I think this is the bit that is worrying me - is this storing the SIDs
>> > into Solr at document index time? This would be a problem for a whole
>> > load of reasons, but maybe I'm missing something here? (see below for
>> > a possible
>> > alternative)
>> >
>> > Basically, what I'm getting at here is that the allow/deny values need
>> > to be stored in one of three places:
>> >  1. In the authority (e.g. inside AD)
>> >  2. In the document metadata (index-time)  3. In external storage
>> > (e.g. acl.xml, NTFS etc.)
>> >
>> > 1. Extending AD is pretty much out, as this causes too many interop
>> > problems 2. 'Hard-coding' acl information in the index makes it
>> > non-portable, resistent to changes, etc.
>> > 3. acl.xml is coupled with a Solr instance, but is easily
>> > ported/replicated.
>> > Storing/retrieving acl information from the source (e.g. NTFS) is
>> > problematic, as the source may not be accessible (it may not even
>> exist).
>> >
>> > I believe 3. or a variant is the way to go on the repo side, which
>> > means the LCF Authority connector is mainly for Authentication (see
>> > above), which is what you want from AD et al integration.
>> > The problem that arises from 'pluggable' authentication is that, if
>> > you're not using a certificate, you have to start with a password, but
>> > the connector only has access to the password hash (unless the pwd is
>> > sent in the query url). I don't know of a way to confirm identities in
>> > AD using only the username and hash (AD does the hash compare). I
>> > believe this is where container-based integration will likely work
>> better.
>> >
>> > So that I can confirm my understanding...a scenario might be like this:
>> >
>> > We have an AD connector that fetches the SIDs and we can read them etc.
>> > For my environment, where there are no 'files' (there's only a
>> > transient network stream), we have an LCF 'Solr Field Filter Query'
>> > connector that decides which Filter Queries to apply (allow and deny)
>> > for the passed in SID(s).
>> >
>> > For another environment, let's say, NTFS, there might be an 'NTFS'
>> > connector that would provide some kind of mapping of files/folders to
>> > SID(s). Since Solr wouldn't intrinscially know about this, the acl
>> > information would need to be stored somewhere in the index. This would
>> > mean extending the Solr schema and storing metadata at index time.
>> > The alternative is to re-use the 'Solr Field Filter Query' connector
>> > for this as well (and any other document types that might be read in).
>> > This keeps the index 'clean' of acl-specific metadata, and allows for
>> > in-place changes and easy cross-document/index/instance access control.
>> >
>> >
>> > If the above interpretation is [roughly] correct (please let me know
>> > if I've got this wrong!), this would reduce down to having:
>> >   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
>> > (possibly/partly at the container level)
>> >   2. At least an LCF Repository connector for 'acl.xml'
>> >   3. Optional other LCF Repository connectors
>> >
>> > It sounds like you've now finished the first half of 1. by adding the
>> > ability to get the required auth data from a Solr api call. The other
>> > half of 1. will be implementing the LCF interface in the
>> > SolrACLSecurity class, to effectively replace the 'user', 'group' and
>> 'password' bits of acl.xml.
>> >
>> > Does the above sound like an accurate interpretation? Just trying to
>> > get a good picture of what work needs doing, where it goes, etc.
>> >
>> > Many thanks!
>> > Peter
>> >
>> >
>> >
>> >
>> > On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>> >
>> > >  >>>>>>
>> > > What is the relationship between stored data (documents) and
>> authorities'
>> > > access/deny attributes? (do you have any examples of what an
>> > > access_token value might contain?) <<<<<<
>> > >
>> > > Documents have access/deny attributes; authorities simply provide
>> > > the list of tokens that belong to an authenticated user.  Thus,
>> > > there's no access/deny for an authority; that's attached to the
>> > > document (as it is in real-world repositories).
>> > >
>> > > Let's run a quick example, using Active Directory and a Windows file
>> > > system.  Suppose that you have a directory with documents in it,
>> > > call it DirectoryA, and the directory allows read access to the
>> > > following
>> > SIDs:
>> > >
>> > > S-123-456-76890
>> > > S-23-64-12345
>> > >
>> > > These SIDs correspond to active directory groups, let's call them
>> > > Group1 and Group2, respectively.
>> > >
>> > > DirectoryB also has documents in it, and those documents have just
>> > > the SID S-123-456-76890 attached, because only Group1 can read its
>> contents.
>> > >
>> > > Now, pretend that someone has created an LCF Active Directory
>> > > authority connection (in the LCF UI), which is called "myAD", and
>> > > this connection is set up to talk to the governing AD domain
>> > > controller for this Windows file system.  We now know enough to
>> > > describe the document
>> > indexing process:
>> > >
>> > > - Each file in DirectoryA will have the following
>> > > __ALLOW_TOKEN__document attributes inside Solr:
>> > > "myAD:S-123-456-76890",
>> > and "myAD:S-23-64-12345".
>> > > - Each file in DirectoryB will have the following
>> > > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
>> > >
>> > > Now, suppose that a user (let's call him "Peter") is authenticated
>> > > with the AD domain controller.  Peter belongs to Group2, so his SIDs
>> > > are
>> > (say):
>> > >
>> > > S-1-1-0 (the 'everyone' SID)
>> > > S-323-999-12345 (his own personal user SID)
>> > > S-23-64-12345 (the SID he gets because he belongs to group 2)
>> > >
>> > > We want to look up the documents in the search index that he can see.
>> > > So, we ask the LCF authority service what his tokens are, and we get
>> > back:
>> > >
>> > > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
>> > >
>> > > The documents we should return in his search are the ones matching
>> > > his search criteria, PLUS the intersection of his tokens with the
>> > > document ALLOW tokens, MINUS the intersection of his tokens with the
>> > > document DENY tokens (there aren't any involved in this example).
>> > > So only files that have one of his three tokens as an ALLOW
>> > > attribute would be
>> > returned.
>> > >
>> > > Note that what we are attempting to do is enforce AD's security with
>> > > the search results we present.  There is no need to define a whole
>> > > new security mechanism, because AD already has one that people use.
>> > >
>> > > >>>>>>
>> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
>> > > to ensure there are no security or other dependencies of indexed
>> > > data with any external repository - most notably the file system.
>> > > There are many reasons for wanting this, but one of the main ones is
>> > > that Solr-stored data is not always based on file data (or
>> > > accessible
>> > file data).
>> > > In fact, in my particular case, almost none of the indexed data
>> > > comes from files.
>> > > <<<<<<
>> > >
>> > > LCF is all about abstracting from repositories.  It's not
>> > > specifically about a file system, although that is a convenient
>> > > example.  If you are building your own kind of repository with your
>> > > own security setup, that's fine - but in the LCF world you'd need to
>> > > create an authority connector for your repository (which maybe reads
>> > > your acl.xml file), as well as a repository connector (which hands
>> > > documents to LCF and provides it with the access tokens that make
>> > > security work).  Of course, you can something much lighter that
>> > > doesn't include LCF at all if you are just integrating a custom
>> > > repository of your own, but it sounded like you were interested in the
>> broader problem here.
>> > >
>> > > So, LCF doesn't do "acl mapping" at all.  It relies on its various
>> > > connectors to work cooperatively to define access tokens in a way
>> > > that is consistent from authority connector to repository connector
>> > > for a given repository kind.  Anybody can write a connector, so the
>> > > beauty of all this is that you can build a system where data from
>> > > many disparate sources is indexed, and security for each is
>> > > simultaneously
>> > enforced.
>> > >
>> > > Karl
>> > >
>> > >
>> > >  ------------------------------
>> > > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> > > *Sent:* Thursday, April 22, 2010 9:24 AM
>> > >
>> > > *To:* dev@lucene.apache.org
>> > > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> > > connectors-dev@incubator.apache.org
>> > > *Subject:* Re: FW: Solr and LCF security at query time
>> > >
>> > > Hi Karl,
>> > >
>> > > Thanks very much for the diagram -
>> > > Sorry about all the questions, but this raises a few new ones...
>> > >
>> > > What is the relationship between stored data (documents) and
>> authorities'
>> > > access/deny attributes? (do you have any examples of what an
>> > > access_token value might contain?)
>> > >
>> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
>> > > to ensure there are no security or other dependencies of indexed
>> > > data with any external repository - most notably the file system.
>> > > There are many reasons for wanting this, but one of the main ones is
>> > > that Solr-stored data is not always based on file data (or
>> > > accessible
>> > file data).
>> > > In fact, in my particular case, almost none of the indexed data
>> > > comes from files.
>> > >
>> > > This is one reason why SOLR-1872 uses filter queries for its
>> > > access/deny tokens - so that all the required information for access
>> > > control completely resides within the Solr index itself.
>> > > Is the LCF architecture acl 'mapping' between Solr fields (queries)
>> > > and users, some external 'repository' (files) and users, or
>> > > arbitrary
>> > data (e.g.
>> > > either of these)?
>> > >
>> > > I hope that makes sense...
>> > >
>> > > Thanks!
>> > > Peter
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
>> > >
>> > >> Hi Peter,
>> > >>
>> > >> I've attached a diagram that is not in the wiki as of yet, and I'll
>> > >> try to answer your questions.
>> > >>
>> > >> >>>>>>
>> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
>> > >> stored for a particular user in the underlying acl store (e.g.
>> > >> Active
>> > Directory)?
>> > >> How does AD and/or LCF handle storing such data in its schema?
>> > >> (does AD needs its schema extended?) Presumably, any such AD fields
>> > >> would need to be queried for effective rights in order to cater for
>> > >> group membership allows and denies.
>> > >> <<<<<<
>> > >>
>> > >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
>> > >> strings that represent a contract between an LCF authority
>> > >> connection and the LCF repository connection that picks up the
>> > >> documents (from
>> > wherever).
>> > >>  These tokens thus have no real meaning outside of LCF.  You must
>> > >> regard them as opaque.
>> > >>
>> > >> The contract, however, states that if you use the LCF authority
>> > >> service to obtain tokens for an authenticated user, you will get
>> > >> back a set that is CONSISTENT with the tokens that were attached to
>> > >> the documents LCF sent to Solr for indexing in the first place.
>> > >> So, you don't have to worry about it, and that's kind of the idea.
>> > >> So you
>> > imagine the following flow:
>> > >>
>> > >> (1) Use LCF to fetch documents and send them to Solr
>> > >> (2) When searching, use the LCF authority service to get the
>> > >> desired user's access tokens
>> > >> (3) Either filter the results, or modify the query, to be sure the
>> > >> access tokens all match up properly
>> > >>
>> > >> For the AD authority, the LCF access tokens consist, in part, of
>> > >> the user's SIDs.  For other authorities, the access tokens are
>> > >> wildly
>> > different.
>> > >>  You really don't want to know what's in them, since that's the job
>> > >> of the LCF authority to determine. ;-)
>> > >>
>> > >> LCF is not, by the way, joined at the hip with AD.  However, in
>> > >> practice, most enterprises in the world use some form of AD single
>> > >> signon for their web applications, and even if they're using some
>> > >> repository with its own idea of security, there's a mapping between
>> > >> the AD users and the repository's users.  Doing that mapping is
>> > >> also the job of the LCF authority for that repository.
>> > >>
>> > >> Hope this helps.  Also, I'm not expecting time miracles here, so
>> > >> don't sweat the schedule.
>> > >>
>> > >>
>> > >> Karl
>> > >>
>> > >>
>> > >> ________________________________________
>> > >> From: ext Peter Sturge [peter.sturge@googlemail.com]
>> > >> Sent: Thursday, April 22, 2010 4:27 AM
>> > >> To: dev@lucene.apache.org
>> > >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> > >> connectors-dev@incubator.apache.org
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> Hi Karl,
>> > >>
>> > >> Thanks for the quick turnaround.
>> > >> I'm in the middle of a product release for us, so I fear I won't be
>> > >> as quick as you... :-)
>> > >>
>> > >> I couldn't find a simple flow diagram or similar for LCF with
>> > >> regards security (probably looking in the wrong place).
>> > >> Perhaps you could help on these questions...?
>> > >>
>> > >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
>> > >> sub-queries, which are then used as filter queries in a user's
>> search.
>> > >>
>> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
>> > >> stored for a particular user in the underlying acl store (e.g.
>> > >> Active
>> > Directory)?
>> > >> How does AD and/or LCF handle storing such data in its schema?
>> > >> (does AD needs its schema extended?) Presumably, any such AD fields
>> > >> would need to be queried for effective rights in order to cater for
>> > >> group membership allows and denies.
>> > >>
>> > >> I guess I'm just trying to understand the architectural
>> > >> flow/storage/retrieval of data in the various parts of the system,
>> > >> but I admit, I need to do more research on this.
>> > >> After our product release, when I get a few more spare cycles, I
>> > >> can look at it in more detail.
>> > >>
>> > >> Many thanks!
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> Hi Peter,
>> > >>
>> > >> I just committed the promised changes to the LCF Solr output
>> connector.
>> > >>
>> > >> ACL metadata will now be posted to the Solr Http interface along
>> > >> with the document as the two following fields:
>> > >>
>> > >> __ACCESS_TOKEN__document
>> > >> __DENY_TOKEN__document
>> > >>
>> > >> There will, of course, potentially be multiple values for each of
>> > >> these two fields.
>> > >>
>> > >> Hope this helps,
>> > >> Karl
>> > >>
>> > >> ________________________________
>> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> > >> peter.sturge@googlemail.com>]
>> > >> Sent: Tuesday, April 20, 2010 6:51 PM
>> > >>
>> > >> To: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> Hi Karl,
>> > >>
>> > >> Thanks for the info. I'll have a look at the link and try to take
>> > >> in as much sugar as my insulin levels will handle...
>> > >> It sounds like the necessary interface(s) are already in LCF - just
>> > >> a matter of implementing them in the Solr 1872 plugin.
>> > >> I'll need to digest the LCF stuff to get to grips with it..please
>> > >> bear with me while I do that...
>> > >>
>> > >> When you say:
>> > >>   The LCF solr output connection doesn't yet do this, but it is
>> > >> trivial for me to make that happen.
>> > >> Do you mean a mechanism by which solr.war can get url et al info
>> > >> from its parent container (Tomcat, Jetty etc.), or have I
>> > >> misinterpreted
>> > this?
>> > >>
>> > >>
>> > >> Thanks,
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> Hi Peter,
>> > >>
>> > >> I'm the principal committer for LCF, but I don't know as much about
>> > >> Solr as I ought to, so it sounds like a potentially productive
>> > collaboration.
>> > >>
>> > >> LCF does exactly what you are looking for - the only issue at all
>> > >> is that you need to fetch a URL from a webapp to get what you are
>> > >> looking for.  The "plugs" are all inside LCF for different kinds of
>> > >> repositories.  Here's a link that might help with drinking the LCF
>> > "koolaid", as it were:
>> > >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
>> > >> ct
>> > >> ors+Framework+concepts
>> > >>
>> > >> The url would be something like this (on a locally installed
>> > >> tomcat-based LCF instance):
>> > >>
>> > >>
>> > >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
>> > >> se
>> > >> rname@somedomain.com
>> > >>
>> > >> ... and this fetch returns something like:
>> > >>
>> > >> TOKEN:xxxxxxx
>> > >> TOKEN:yyyyyyy
>> > >> TOKEN:zzzzzzz
>> > >> ....
>> > >>
>> > >> ... which represent the amalgamated tokens for all of the defined
>> > >> authorities, and by some strange coincidence ( ;-) ) are compatible
>> > >> with certain pieces of metadata that have been passed into Solr
>> > >> with each document - one set of Allow tokens, and a second set of
>> > >> Deny tokens.  The LCF solr output connection doesn't yet do this,
>> > >> but it is trivial for me to make that happen.
>> > >>
>> > >> Does this sound plausible to you?
>> > >>
>> > >> Karl
>> > >>
>> > >>
>> > >> ________________________________
>> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> > >> peter.sturge@googlemail.com>]
>> > >> Sent: Tuesday, April 20, 2010 5:41 PM
>> > >> To: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
>> > >> dev@lucene.apache.org>
>> > >>
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> Hi Karl,
>> > >>
>> > >> Integrating LCF to get external token support for SOLR-1872 sounds
>> > >> very interesting indeed. I don't know anything about LCF, but one
>> > >> of the things I was planning for SOLR-1872 is to make acl.xml (or
>> > >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
>> > >> series of plugins that could be used for obtaining back-end
>> > >> authentication
>> > information.
>> > >>
>> > >> If you're good with LCF, perhaps we could work together to build
>> > >> this
>> > in.
>> > >> One of the first things would be defining an interface that would
>> > >> be as easy as possible to plug LCF into. Have you any
>> > >> suggestions/insight on this front?
>> > >>
>> > >> Many thanks,
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> SOLR-1872 looks exactly like what I was envisioning, from the
>> > >> search query perspective, although instead of the acl xml file you
>> > >> specify LCF stipulates you would dynamically query the
>> > >> lcf-authority-service servlet for the access tokens themselves.
>> > >> That would get you support for AD, Documentum, LiveLink, Meridio,
>> > >> and Memex for free. It seems likely that this component could be
>> > >> modified to work with LCF with minor
>> > effort.
>> > >>
>> > >> The missing component still seems to be AD authentication, which
>> > >> needs a solution.
>> > >>
>> > >> Karl
>> > >>
>> > >> ________________________________
>> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> > >> peter.sturge@googlemail.com>]
>> > >> Sent: Tuesday, April 20, 2010 10:44 AM
>> > >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> > >> Subject: Re: FW: Solr and LCF security at query time
>> > >>
>> > >> If you want to do this completely within Solr, have a look at:
>> > >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>> > >>
>> > >> Thanks,
>> > >> Peter
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
>> > >> karl.wright@nokia.com>> wrote:
>> > >> FYI
>> > >>
>> > >> ________________________________
>> > >> From: Wright Karl (Nokia-S/Cambridge)
>> > >> Sent: Tuesday, April 20, 2010 8:16 AM
>> > >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
>> > >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
>> > >> connectors-dev@incubator.apache.org<mailto:
>> > >> connectors-dev@incubator.apache.org>'; '
>> > >> connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>'
>> > >> Subject: RE: Solr and LCF security at query time
>> > >>
>> > >> Dominique,
>> > >>
>> > >> Yes, I am aware of this ticket and contribution.  Luckily LCF
>> > >> establishes a powerful multi-repository security model, even though
>> > >> it doesn't yet do the final step of enforcing that model at the
>> > >> search end.  LCF allows you to define multiple authorities to
>> > >> operate against disparate repositories, and use the appropriate
>> > >> authority to secure any given document.  The solr people are aware
>> > >> of this design, which addresses the issues raised by SOLR-1834 very
>> > >> nicely.  However, as I said before, time is a problem, and the work
>> > >> still needs to be
>> > done.
>> > >>
>> > >> I suggest you read up on the actual security model of LCF, and
>> > >> perhaps experiment with that and the SOLR-1834 contribution, to see
>> > >> if there is common ground.  One thing we've learned at MetaCarta is
>> > >> that post-filtering for security purposes is expensive, and it is
>> > >> better to modify the queries themselves to restrict the results, if
>> > >> possible.  I'm not sure which approach SOLR-1834 takes, although it
>> > >> sounds like it might be the filtering approach.  Still, it would be
>> > better than nothing.
>> > >>
>> > >> Please let me know what you find out.
>> > >>
>> > >> Thanks,
>> > >> Karl
>> > >>
>> > >> ________________________________
>> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
>> > >> dominique.bejean@eolya.fr>]
>> > >> Sent: Tuesday, April 20, 2010 8:03 AM
>> > >> To: Wright Karl (Nokia-S/Cambridge)
>> > >> Cc: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>;
>> > >> connectors-dev@incubator.apache.org<mailto:
>> > >> connectors-dev@incubator.apache.org>
>> > >> Subject: Re: Solr and LCF security at query time
>> > >>
>> > >> Karl,
>> > >>
>> > >> Thank you for your reply.
>> > >>
>> > >> I made some research today and I found this :
>> > >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
>> > >> 83
>> > >> 4 http://demo.findwise.se:8880/SolrSecurity/
>> > >>
>> > >> Sorl security model have to be able to filter result list with
>> > >> items coming from various sources at the same time (livelink,
>> > >> documentum, file system, ...). Big subject :)
>> > >>
>> > >> Dominique
>> > >>
>> > >>
>> > >> Le 20/04/10 13:34,
>> > >> karl.wright@nokia.com<ma...@nokia.com> a ?crit :
>> > >> Hi Dominique,
>> > >>
>> > >> At the moment, in order to enforce the LCF security model within
>> > >> Lucene/Solr, you will need to build this functionality into
>> > >> whatever client you are using to display the Lucene search results.
>> > >> Specifically, you would need to take the following steps:
>> > >>
>> > >> (1) Have your users access your search client through Apache.
>> > >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>> > >> mod_authz_annotate, to cause authorization HTTP headers to be
>> > >> transmitted to the client webapp.
>> > >> (3) Have your client webapp alter whatever queries it is doing, to
>> > >> add an appropriate query clause for each of the access tokens
>> > >> transmitted in the headers.
>> > >>
>> > >> (This is how it is done at MetaCarta.)
>> > >>
>> > >> Alternatively, you may find a way to do this completely with a web
>> > >> application under a Java app server such as Tomcat.  I have not yet
>> > >> done the research to find out whether this is a feasible alternative.
>> > >> Effectively, what you need something like mod_auth_kerb to do is to
>> > >> authenticate your user against Active Directory, or whomever the
>> > authenticator ought to be.
>> > >>  JAAS may be helpful here.
>> > >>
>> > >> There are, of course, intentions to fill out the missing pieces
>> > >> more completely and transparently via a Solr search plugin and/or
>> filter.
>> > >> What has been lacking is time.  If you are in a position to do
>> > >> development in this area, we're happy to have any assistance you
>> > >> might
>> > provide.
>> > >>
>> > >> Thanks,
>> > >> Karl
>> > >> ________________________________
>> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> > >> Sent: Tuesday, April 20, 2010 5:06 AM
>> > >> To: connectors-user@incubator.apache.org<mailto:
>> > >> connectors-user@incubator.apache.org>
>> > >>  Subject: Solr and LCF security at query time
>> > >>
>> > >> Hi,
>> > >>
>> > >> I don't see in LCF wiki how Solr and LCF works together at query
>> > >> time in order to remove from the result list the items the user is
>> > >> not allowed to access.
>> > >>
>> > >> In
>> > >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
>> > >> ep
>> > >> ts.html,
>> > >> I just see these sentences :
>> > >>
>> > >> " Once all these documents and their access tokens are handed to
>> > >> the search engine, it is the search engine's job to enforce
>> > >> security by excluding inappropriate documents from the search
>> > >> results. For Lucene, this infrastructure is expected to be built on
>> > >> top of Lucene's generic metadata abilities, but has not been
>> > >> implemented at
>> > this time."
>> > >>
>> > >> I am not sure to understand. Does this mean that for the moment, it
>> > >> is not possible for Solr to apply security by using an Authority
>> > Connector ?
>> > >>
>> > >> Dominique
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> -------------------------------------------------------------------
>> > >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>> > >> additional commands, e-mail: dev-help@lucene.apache.org
>> > >>
>> > >
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>> > additional commands, e-mail: dev-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
>>>>>>
With regards schema extension, I believe we need to be very careful here, as requiring index-time storage of access control data will pose a problem for any use cases where the access control needs to change (maybe often, maybe only occasionally). I'm trying to think of a use case where this wouldn't at least potentially be the case, and I can't think of one, but perhaps I'm not truly understanding what exactly is stored in the __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in (e.g. let's say someone has left my organization, do I have to update documents to remove his/her access?).
<<<<<<

Usually the way this works is that the user's account is locked out so they can't log in.  The authority service picks up this change, and it therefore takes place immediately.

Bear in mind that this particular model has been employed by MetaCarta for more than five years in the field with clients such as pretty near all the major oil companies, many U.S. government agencies, the U.S. military, etc.  In that time we have not heard even one complaint about the security model.

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Wednesday, April 28, 2010 7:18 AM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Apologies for the delayed reply. I've been away on business, and in the middle of a product release, so it's been a busy time...

In response to your eariler questions:

The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses, although the point at which these are done is slightly different.

I think I am correct in my understanding that with filter queries, the results are filtered 'post-Lucene', but are separately (Solr) cached, so you get a hit on the first search, but then benefit from cached hits on subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are applied at the Lucene query directly, so don't have separate Solr caching. I've not benchmarked the two, so one or other might be slower/faster for various search scenarios.

In any case, I believe either technique can be employed in either 1834 or 1872.


With regards schema extension, I believe we need to be very careful here, as requiring index-time storage of access control data will pose a problem for any use cases where the access control needs to change (maybe often, maybe only occasionally). I'm trying to think of a use case where this wouldn't at least potentially be the case, and I can't think of one, but perhaps I'm not truly understanding what exactly is stored in the __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in (e.g. let's say someone has left my organization, do I have to update documents to remove his/her access?).

Also, would such indexed tokens be entirely 'document-context-free'? I.e. Would the same type/format of tokens be used for data from different sources (e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be compatible with multiple and/or changing authorities (e.g. AD, documentum, LDAP, custom, etc.)?

I like the idea of an LCF plugin to hold the acl data. I admit, I've not had enough time to look into how this might look at the moment, but it sounds like it could be a good way to hold generic (authority-agnostic) acl data, and [hopefully] not have to tie it to document data at index-time.

I hope this makes sense, but if I've misunderstood the proposed mechanism, please correct me. Would the __ALLOW_TOKEN__ et al fields store, for example, SID information?


Thanks,
Peter



On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com>> wrote:
Ok, not hearing back from Peter, I've done some Solr research and written some code that might work.  The approach I've taken is most similar to SOLR 1834, other than the LCF-centric logic.  Hopefully there will be a chance to try this out in a full end-to-end way  on the weekend, after which I will submit it to the Solr team (where I think it most naturally would be built and delivered).

What it's going to need is either a static or dynamic schema addition to define __ALLOW_TOKEN__document, __DENY_TOKEN__document, __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be string, multivalued fields (I think).  It would be great if these could be made a default part of Solr; similarly, it would be good if the new search component was predelivered with Solr and mentioned (even if commented out) in the example solrconfig.xml file.  The only other thing that needs to be done to hook up the search component is to include a configuration parameter describing the base URL of the LCF authority service.  Plus, as I said earlier, we still don't have a canned solution for authentication yet - although I feel that will be straightforward.

Comments welcome...
Karl


________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 27, 2010 8:20 AM
To: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com>> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
 * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
 * What happens if you need to revoke or change a user's or group's access?
 * It's difficult to move/replicate the index to another domain
 * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com>> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org<ma...@lucene.apache.org>
> > *Cc:* connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> > connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com>> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com<ma...@somedomain.com>
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; dev@lucene.apache.org<ma...@lucene.apache.org><mailto:
> >> dev@lucene.apache.org<ma...@lucene.apache.org>>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>>'; '
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>'; '
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr><mailto:
> >> dominique.bejean@eolya.fr<ma...@eolya.fr>>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com>> a ?crit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> >> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>



RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
I've attached the code I wrote, but haven't yet tested.  Any comments?
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Wednesday, April 28, 2010 7:18 AM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Apologies for the delayed reply. I've been away on business, and in the middle of a product release, so it's been a busy time...

In response to your eariler questions:

The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses, although the point at which these are done is slightly different.

I think I am correct in my understanding that with filter queries, the results are filtered 'post-Lucene', but are separately (Solr) cached, so you get a hit on the first search, but then benefit from cached hits on subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are applied at the Lucene query directly, so don't have separate Solr caching. I've not benchmarked the two, so one or other might be slower/faster for various search scenarios.

In any case, I believe either technique can be employed in either 1834 or 1872.


With regards schema extension, I believe we need to be very careful here, as requiring index-time storage of access control data will pose a problem for any use cases where the access control needs to change (maybe often, maybe only occasionally). I'm trying to think of a use case where this wouldn't at least potentially be the case, and I can't think of one, but perhaps I'm not truly understanding what exactly is stored in the __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in (e.g. let's say someone has left my organization, do I have to update documents to remove his/her access?).

Also, would such indexed tokens be entirely 'document-context-free'? I.e. Would the same type/format of tokens be used for data from different sources (e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be compatible with multiple and/or changing authorities (e.g. AD, documentum, LDAP, custom, etc.)?

I like the idea of an LCF plugin to hold the acl data. I admit, I've not had enough time to look into how this might look at the moment, but it sounds like it could be a good way to hold generic (authority-agnostic) acl data, and [hopefully] not have to tie it to document data at index-time.

I hope this makes sense, but if I've misunderstood the proposed mechanism, please correct me. Would the __ALLOW_TOKEN__ et al fields store, for example, SID information?


Thanks,
Peter



On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com>> wrote:
Ok, not hearing back from Peter, I've done some Solr research and written some code that might work.  The approach I've taken is most similar to SOLR 1834, other than the LCF-centric logic.  Hopefully there will be a chance to try this out in a full end-to-end way  on the weekend, after which I will submit it to the Solr team (where I think it most naturally would be built and delivered).

What it's going to need is either a static or dynamic schema addition to define __ALLOW_TOKEN__document, __DENY_TOKEN__document, __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be string, multivalued fields (I think).  It would be great if these could be made a default part of Solr; similarly, it would be good if the new search component was predelivered with Solr and mentioned (even if commented out) in the example solrconfig.xml file.  The only other thing that needs to be done to hook up the search component is to include a configuration parameter describing the base URL of the LCF authority service.  Plus, as I said earlier, we still don't have a canned solution for authentication yet - although I feel that will be straightforward.

Comments welcome...
Karl


________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 27, 2010 8:20 AM
To: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com>> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
 * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
 * What happens if you need to revoke or change a user's or group's access?
 * It's difficult to move/replicate the index to another domain
 * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com>> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org<ma...@lucene.apache.org>
> > *Cc:* connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> > connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com>> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com<ma...@somedomain.com>
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; dev@lucene.apache.org<ma...@lucene.apache.org><mailto:
> >> dev@lucene.apache.org<ma...@lucene.apache.org>>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>>'; '
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>'; '
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr><mailto:
> >> dominique.bejean@eolya.fr<ma...@eolya.fr>>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com>> a ?crit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> >> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>



RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
>>>>>>
With regards schema extension, I believe we need to be very careful here, as requiring index-time storage of access control data will pose a problem for any use cases where the access control needs to change (maybe often, maybe only occasionally). I'm trying to think of a use case where this wouldn't at least potentially be the case, and I can't think of one, but perhaps I'm not truly understanding what exactly is stored in the __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in (e.g. let's say someone has left my organization, do I have to update documents to remove his/her access?).
<<<<<<

Usually the way this works is that the user's account is locked out so they can't log in.  The authority service picks up this change, and it therefore takes place immediately.

Bear in mind that this particular model has been employed by MetaCarta for more than five years in the field with clients such as pretty near all the major oil companies, many U.S. government agencies, the U.S. military, etc.  In that time we have not heard even one complaint about the security model.

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Wednesday, April 28, 2010 7:18 AM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Apologies for the delayed reply. I've been away on business, and in the middle of a product release, so it's been a busy time...

In response to your eariler questions:

The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses, although the point at which these are done is slightly different.

I think I am correct in my understanding that with filter queries, the results are filtered 'post-Lucene', but are separately (Solr) cached, so you get a hit on the first search, but then benefit from cached hits on subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are applied at the Lucene query directly, so don't have separate Solr caching. I've not benchmarked the two, so one or other might be slower/faster for various search scenarios.

In any case, I believe either technique can be employed in either 1834 or 1872.


With regards schema extension, I believe we need to be very careful here, as requiring index-time storage of access control data will pose a problem for any use cases where the access control needs to change (maybe often, maybe only occasionally). I'm trying to think of a use case where this wouldn't at least potentially be the case, and I can't think of one, but perhaps I'm not truly understanding what exactly is stored in the __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in (e.g. let's say someone has left my organization, do I have to update documents to remove his/her access?).

Also, would such indexed tokens be entirely 'document-context-free'? I.e. Would the same type/format of tokens be used for data from different sources (e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be compatible with multiple and/or changing authorities (e.g. AD, documentum, LDAP, custom, etc.)?

I like the idea of an LCF plugin to hold the acl data. I admit, I've not had enough time to look into how this might look at the moment, but it sounds like it could be a good way to hold generic (authority-agnostic) acl data, and [hopefully] not have to tie it to document data at index-time.

I hope this makes sense, but if I've misunderstood the proposed mechanism, please correct me. Would the __ALLOW_TOKEN__ et al fields store, for example, SID information?


Thanks,
Peter



On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com>> wrote:
Ok, not hearing back from Peter, I've done some Solr research and written some code that might work.  The approach I've taken is most similar to SOLR 1834, other than the LCF-centric logic.  Hopefully there will be a chance to try this out in a full end-to-end way  on the weekend, after which I will submit it to the Solr team (where I think it most naturally would be built and delivered).

What it's going to need is either a static or dynamic schema addition to define __ALLOW_TOKEN__document, __DENY_TOKEN__document, __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be string, multivalued fields (I think).  It would be great if these could be made a default part of Solr; similarly, it would be good if the new search component was predelivered with Solr and mentioned (even if commented out) in the example solrconfig.xml file.  The only other thing that needs to be done to hook up the search component is to include a configuration parameter describing the base URL of the LCF authority service.  Plus, as I said earlier, we still don't have a canned solution for authentication yet - although I feel that will be straightforward.

Comments welcome...
Karl


________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 27, 2010 8:20 AM
To: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com>> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
 * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
 * What happens if you need to revoke or change a user's or group's access?
 * It's difficult to move/replicate the index to another domain
 * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com>> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org<ma...@lucene.apache.org>
> > *Cc:* connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> > connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com>> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com<ma...@somedomain.com>
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; dev@lucene.apache.org<ma...@lucene.apache.org><mailto:
> >> dev@lucene.apache.org<ma...@lucene.apache.org>>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>>'; '
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>'; '
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr><mailto:
> >> dominique.bejean@eolya.fr<ma...@eolya.fr>>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com>> a ?crit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> >> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>



RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
I've attached the code I wrote, but haven't yet tested.  Any comments?
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Wednesday, April 28, 2010 7:18 AM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Apologies for the delayed reply. I've been away on business, and in the middle of a product release, so it's been a busy time...

In response to your eariler questions:

The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses, although the point at which these are done is slightly different.

I think I am correct in my understanding that with filter queries, the results are filtered 'post-Lucene', but are separately (Solr) cached, so you get a hit on the first search, but then benefit from cached hits on subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are applied at the Lucene query directly, so don't have separate Solr caching. I've not benchmarked the two, so one or other might be slower/faster for various search scenarios.

In any case, I believe either technique can be employed in either 1834 or 1872.


With regards schema extension, I believe we need to be very careful here, as requiring index-time storage of access control data will pose a problem for any use cases where the access control needs to change (maybe often, maybe only occasionally). I'm trying to think of a use case where this wouldn't at least potentially be the case, and I can't think of one, but perhaps I'm not truly understanding what exactly is stored in the __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in (e.g. let's say someone has left my organization, do I have to update documents to remove his/her access?).

Also, would such indexed tokens be entirely 'document-context-free'? I.e. Would the same type/format of tokens be used for data from different sources (e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be compatible with multiple and/or changing authorities (e.g. AD, documentum, LDAP, custom, etc.)?

I like the idea of an LCF plugin to hold the acl data. I admit, I've not had enough time to look into how this might look at the moment, but it sounds like it could be a good way to hold generic (authority-agnostic) acl data, and [hopefully] not have to tie it to document data at index-time.

I hope this makes sense, but if I've misunderstood the proposed mechanism, please correct me. Would the __ALLOW_TOKEN__ et al fields store, for example, SID information?


Thanks,
Peter



On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com>> wrote:
Ok, not hearing back from Peter, I've done some Solr research and written some code that might work.  The approach I've taken is most similar to SOLR 1834, other than the LCF-centric logic.  Hopefully there will be a chance to try this out in a full end-to-end way  on the weekend, after which I will submit it to the Solr team (where I think it most naturally would be built and delivered).

What it's going to need is either a static or dynamic schema addition to define __ALLOW_TOKEN__document, __DENY_TOKEN__document, __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be string, multivalued fields (I think).  It would be great if these could be made a default part of Solr; similarly, it would be good if the new search component was predelivered with Solr and mentioned (even if commented out) in the example solrconfig.xml file.  The only other thing that needs to be done to hook up the search component is to include a configuration parameter describing the base URL of the LCF authority service.  Plus, as I said earlier, we still don't have a canned solution for authentication yet - although I feel that will be straightforward.

Comments welcome...
Karl


________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 27, 2010 8:20 AM
To: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com>> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
 * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
 * What happens if you need to revoke or change a user's or group's access?
 * It's difficult to move/replicate the index to another domain
 * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com>> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org<ma...@lucene.apache.org>
> > *Cc:* connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> > connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com>> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com<ma...@somedomain.com>
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; dev@lucene.apache.org<ma...@lucene.apache.org><mailto:
> >> dev@lucene.apache.org<ma...@lucene.apache.org>>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>>'; '
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>'; '
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr><mailto:
> >> dominique.bejean@eolya.fr<ma...@eolya.fr>>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com>> a ?crit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> >> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>



RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
>>>>>>
With regards schema extension, I believe we need to be very careful here, as requiring index-time storage of access control data will pose a problem for any use cases where the access control needs to change (maybe often, maybe only occasionally). I'm trying to think of a use case where this wouldn't at least potentially be the case, and I can't think of one, but perhaps I'm not truly understanding what exactly is stored in the __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in (e.g. let's say someone has left my organization, do I have to update documents to remove his/her access?).
<<<<<<

Usually the way this works is that the user's account is locked out so they can't log in.  The authority service picks up this change, and it therefore takes place immediately.

Bear in mind that this particular model has been employed by MetaCarta for more than five years in the field with clients such as pretty near all the major oil companies, many U.S. government agencies, the U.S. military, etc.  In that time we have not heard even one complaint about the security model.

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Wednesday, April 28, 2010 7:18 AM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Apologies for the delayed reply. I've been away on business, and in the middle of a product release, so it's been a busy time...

In response to your eariler questions:

The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses, although the point at which these are done is slightly different.

I think I am correct in my understanding that with filter queries, the results are filtered 'post-Lucene', but are separately (Solr) cached, so you get a hit on the first search, but then benefit from cached hits on subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are applied at the Lucene query directly, so don't have separate Solr caching. I've not benchmarked the two, so one or other might be slower/faster for various search scenarios.

In any case, I believe either technique can be employed in either 1834 or 1872.


With regards schema extension, I believe we need to be very careful here, as requiring index-time storage of access control data will pose a problem for any use cases where the access control needs to change (maybe often, maybe only occasionally). I'm trying to think of a use case where this wouldn't at least potentially be the case, and I can't think of one, but perhaps I'm not truly understanding what exactly is stored in the __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in (e.g. let's say someone has left my organization, do I have to update documents to remove his/her access?).

Also, would such indexed tokens be entirely 'document-context-free'? I.e. Would the same type/format of tokens be used for data from different sources (e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be compatible with multiple and/or changing authorities (e.g. AD, documentum, LDAP, custom, etc.)?

I like the idea of an LCF plugin to hold the acl data. I admit, I've not had enough time to look into how this might look at the moment, but it sounds like it could be a good way to hold generic (authority-agnostic) acl data, and [hopefully] not have to tie it to document data at index-time.

I hope this makes sense, but if I've misunderstood the proposed mechanism, please correct me. Would the __ALLOW_TOKEN__ et al fields store, for example, SID information?


Thanks,
Peter



On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com>> wrote:
Ok, not hearing back from Peter, I've done some Solr research and written some code that might work.  The approach I've taken is most similar to SOLR 1834, other than the LCF-centric logic.  Hopefully there will be a chance to try this out in a full end-to-end way  on the weekend, after which I will submit it to the Solr team (where I think it most naturally would be built and delivered).

What it's going to need is either a static or dynamic schema addition to define __ALLOW_TOKEN__document, __DENY_TOKEN__document, __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be string, multivalued fields (I think).  It would be great if these could be made a default part of Solr; similarly, it would be good if the new search component was predelivered with Solr and mentioned (even if commented out) in the example solrconfig.xml file.  The only other thing that needs to be done to hook up the search component is to include a configuration parameter describing the base URL of the LCF authority service.  Plus, as I said earlier, we still don't have a canned solution for authentication yet - although I feel that will be straightforward.

Comments welcome...
Karl


________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 27, 2010 8:20 AM
To: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com>> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
 * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
 * What happens if you need to revoke or change a user's or group's access?
 * It's difficult to move/replicate the index to another domain
 * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com>> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org<ma...@lucene.apache.org>
> > *Cc:* connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> > connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com>> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com<ma...@somedomain.com>
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; dev@lucene.apache.org<ma...@lucene.apache.org><mailto:
> >> dev@lucene.apache.org<ma...@lucene.apache.org>>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>>'; '
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>'; '
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr><mailto:
> >> dominique.bejean@eolya.fr<ma...@eolya.fr>>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com>> a ?crit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> >> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>



RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
I've attached the code I wrote, but haven't yet tested.  Any comments?
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Wednesday, April 28, 2010 7:18 AM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Apologies for the delayed reply. I've been away on business, and in the middle of a product release, so it's been a busy time...

In response to your eariler questions:

The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses, although the point at which these are done is slightly different.

I think I am correct in my understanding that with filter queries, the results are filtered 'post-Lucene', but are separately (Solr) cached, so you get a hit on the first search, but then benefit from cached hits on subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are applied at the Lucene query directly, so don't have separate Solr caching. I've not benchmarked the two, so one or other might be slower/faster for various search scenarios.

In any case, I believe either technique can be employed in either 1834 or 1872.


With regards schema extension, I believe we need to be very careful here, as requiring index-time storage of access control data will pose a problem for any use cases where the access control needs to change (maybe often, maybe only occasionally). I'm trying to think of a use case where this wouldn't at least potentially be the case, and I can't think of one, but perhaps I'm not truly understanding what exactly is stored in the __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in (e.g. let's say someone has left my organization, do I have to update documents to remove his/her access?).

Also, would such indexed tokens be entirely 'document-context-free'? I.e. Would the same type/format of tokens be used for data from different sources (e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be compatible with multiple and/or changing authorities (e.g. AD, documentum, LDAP, custom, etc.)?

I like the idea of an LCF plugin to hold the acl data. I admit, I've not had enough time to look into how this might look at the moment, but it sounds like it could be a good way to hold generic (authority-agnostic) acl data, and [hopefully] not have to tie it to document data at index-time.

I hope this makes sense, but if I've misunderstood the proposed mechanism, please correct me. Would the __ALLOW_TOKEN__ et al fields store, for example, SID information?


Thanks,
Peter



On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com>> wrote:
Ok, not hearing back from Peter, I've done some Solr research and written some code that might work.  The approach I've taken is most similar to SOLR 1834, other than the LCF-centric logic.  Hopefully there will be a chance to try this out in a full end-to-end way  on the weekend, after which I will submit it to the Solr team (where I think it most naturally would be built and delivered).

What it's going to need is either a static or dynamic schema addition to define __ALLOW_TOKEN__document, __DENY_TOKEN__document, __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be string, multivalued fields (I think).  It would be great if these could be made a default part of Solr; similarly, it would be good if the new search component was predelivered with Solr and mentioned (even if commented out) in the example solrconfig.xml file.  The only other thing that needs to be done to hook up the search component is to include a configuration parameter describing the base URL of the LCF authority service.  Plus, as I said earlier, we still don't have a canned solution for authentication yet - although I feel that will be straightforward.

Comments welcome...
Karl


________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 27, 2010 8:20 AM
To: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-dev@incubator.apache.org<ma...@incubator.apache.org>; connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com>> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
 * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
 * What happens if you need to revoke or change a user's or group's access?
 * It's difficult to move/replicate the index to another domain
 * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com>> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org<ma...@lucene.apache.org>
> > *Cc:* connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> > connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com>> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com<ma...@somedomain.com>
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; dev@lucene.apache.org<ma...@lucene.apache.org><mailto:
> >> dev@lucene.apache.org<ma...@lucene.apache.org>>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com><mailto:
> >> peter.sturge@googlemail.com<ma...@googlemail.com>>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com><mailto:
> >> karl.wright@nokia.com<ma...@nokia.com>>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>>'; '
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>'; '
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr><mailto:
> >> dominique.bejean@eolya.fr<ma...@eolya.fr>>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>;
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com>> a ?crit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org><mailto:
> >> connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> >> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For
> additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>



Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Apologies for the delayed reply. I've been away on business, and in the
middle of a product release, so it's been a busy time...

In response to your eariler questions:

The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses,
although the point at which these are done is slightly different.

I think I am correct in my understanding that with filter queries, the
results are filtered 'post-Lucene', but are separately (Solr) cached, so you
get a hit on the first search, but then benefit from cached hits on
subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are
applied at the Lucene query directly, so don't have separate Solr caching.
I've not benchmarked the two, so one or other might be slower/faster for
various search scenarios.

In any case, I believe either technique can be employed in either 1834 or
1872.


With regards schema extension, I believe we need to be very careful here, as
requiring index-time storage of access control data will pose a problem for
any use cases where the access control needs to change (maybe often, maybe
only occasionally). I'm trying to think of a use case where this wouldn't at
least potentially be the case, and I can't think of one, but perhaps I'm not
truly understanding what exactly is stored in the __ALLOW_TOKEN__ and
__DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in
(e.g. let's say someone has left my organization, do I have to update
documents to remove his/her access?).

Also, would such indexed tokens be entirely 'document-context-free'? I.e.
Would the same type/format of tokens be used for data from different sources
(e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be
compatible with multiple and/or changing authorities (e.g. AD, documentum,
LDAP, custom, etc.)?

I like the idea of an LCF plugin to hold the acl data. I admit, I've not had
enough time to look into how this might look at the moment, but it sounds
like it could be a good way to hold generic (authority-agnostic) acl data,
and [hopefully] not have to tie it to document data at index-time.

I hope this makes sense, but if I've misunderstood the proposed mechanism,
please correct me. Would the __ALLOW_TOKEN__ et al fields store, for
example, SID information?


Thanks,
Peter



On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com> wrote:

> Ok, not hearing back from Peter, I've done some Solr research and written
> some code that might work.  The approach I've taken is most similar to SOLR
> 1834, other than the LCF-centric logic.  Hopefully there will be a chance to
> try this out in a full end-to-end way  on the weekend, after which I will
> submit it to the Solr team (where I think it most naturally would be built
> and delivered).
>
> What it's going to need is either a static or dynamic schema addition to
> define __ALLOW_TOKEN__document, __DENY_TOKEN__document,
> __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be
> string, multivalued fields (I think).  It would be great if these could be
> made a default part of Solr; similarly, it would be good if the new search
> component was predelivered with Solr and mentioned (even if commented out)
> in the example solrconfig.xml file.  The only other thing that needs to be
> done to hook up the search component is to include a configuration parameter
> describing the base URL of the LCF authority service.  Plus, as I said
> earlier, we still don't have a canned solution for authentication yet -
> although I feel that will be straightforward.
>
> Comments welcome...
> Karl
>
>
> ________________________________________
> From: Wright Karl (Nokia-S/Cambridge)
> Sent: Tuesday, April 27, 2010 8:20 AM
> To: connectors-dev@incubator.apache.org; dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org
> Subject: RE: FW: Solr and LCF security at query time
>
> Hi Peter,
>
> I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions
> in detail, and have a couple of SOLR-related questions.
>
> Both contributions rely on a SearchComponent to work their magic.  However,
> it also appears that each modifies the user query in a different way.  1834
> uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND
> and OR filterquery clauses.  Both of them are constructed using Solr
> FilterQuery objects.  Here are my questions:
>
> (1) I am not conversant enough with Solr yet to know the difference between
> the different kinds of clause structure.  Do you know if there is a
> difference?  For example, is there any possibility that AND/OR clauses can
> permit documents to be seen that should not be seen?  (MUST and MUST_NOT
> sound a lot more definite...)
>
> (2) Are Solr FilterQuery objects applied to constructing the query that
> will be sent to Lucene?  Or are they applied by Solr after-the-fact to the
> resultset?  Or, is it a combination of the two, depending on the details of
> your actual filter clause?
>
> I also haven't heard much from you in the last week or so - have you
> thought further about what you intend to do, and can you let me know whether
> you are still interested in developing an LCF plugin for Solr?
>
> Thanks,
> Karl
>
> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 12:23 PM
> To: dev@lucene.apache.org
> Cc: connectors-dev@incubator.apache.org;
> connectors-user@incubator.apache.org; lucene-dev@apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> See inline...
>
> On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:
>
> > Hi Peter,
> >
> > The authority connectors don't perform authentication at this time.
> > In fact, LCF has nothing to do with authentication at all - just
> authorization.
> >  The reason for this is because it is almost never the case that
> > somebody wants to provide multiple credentials in order to be able see
> their results.
> >  Most enterprises who have multiple repositories authenticate against
> > AD and then map AD user names to repository user names in order to
> > access those repositories.  If you noted my earlier posts from this
> > morning, you may have noted that I'm looking at recommending JAAS plus
> > sun's kerb5 login module for handling the "authenticate against AD"
> > case, which would cover some 95%+ of the real world authentication needed
> out there.
> >
> >
> I did read your earlier post regarding this, and I totally agree with you -
> this is best handled 'upstream'. In fact, I use a JAAS plugin in other
> places in the product (not Solr) for authentication.
>
>
> >
> > Yes, the idea is to store SIDs in solr at index time.  I don't know
> > enough about solr to know what kinds of issues this might entail, but
> > Lucene certainly has a model of metadata that's pretty flexible, so I
> > don't think this would be difficult at all.  Eric Hatcher also seemed
> > to confirm my suspicions that this would not be a problem.
> >
>
> It's certainly not a problem to store this data in Solr. The problem is
> more that you don't really *want* to store this data at index time.
> There are lots of reasons for not wanting to 'hard-code' SID data with
> documents in the index. Here's just a few:
>  * What happens if/when you want to add explicit user access to some [group
> of] documents ? (i.e. not via a group)
>  * What happens if you need to revoke or change a user's or group's access?
>  * It's difficult to move/replicate the index to another domain
>  * For AD, SIDs are generally not meant to be stored long term outside of
> AD, as they can be changed (this doesn't happen often, but it can happen
> after an AD rebuild, domain type upgrade, data recovery etc.)
>
> These and other senarios mean re-indexing the stored data. When the index
> is huge, this is non-trivial (time-wise). There are not uncommon scenarios
> where user/group access control can change multiple times in one day.
>
> There might be a way of storing acl data in a payload or similar, but I'm
> not sure how that would work across millions of [arbitrarily grouped ]
> documents (I'm not familiar enough with payloads to know if this would be a
> good or bad idea).
>
>
> >
> > This is exactly why I think that we need to do the authentication
> > upstream of the authority world.
> >
> >
> Agreed.
>
>
>
> >
> > If Solr handles arbitrary document metadata, then I think we could
> > just use that feature.  But you know more about it than me, at this
> > point.  It would be great to get an overview of potential ways of doing
> this.
> >
> >
> Payloads, maybe?
>
>
> >
> > For your particular task, it sounds like you are trying to read from
> > NTFS and apply security after-the-fact with some acl specification
> > file.  In that case, I'd write a repository connector that was based
> > on the file system connector (already part of the stable of connectors
> > for LCF) which reads ACL information from your acl.xml file.  Or, if
> > you prefer a UI for specifying ACL information, you could extend the
> > connector so that security is configured in the UI without having an
> > external acl.xml file at all - which would be a nice addition to the
> > existing file system connector.  (Repository connections and jobs are
> > configured internally in LCF by XML documents stored in the database,
> > so they can be arbitrarily structured.  I'm happy to help you figure
> > out how to do this if this is what you decide to do.)
> >
> > For my particular requirements, there are no files -  the data is
> > generated
> from the network and stored. After the fact, there is no persistent
> location of this data other than in Solr.
>
> Storing the acl info using the connector sounds very interesting. Could be
> worth looking at in more details. Thanks!
>
>
>
> > I think we still need to add in the authentication piece to make this
> > all work for you, so perhaps you can describe how you expect a user to
> > interact with your system, so I can understand your design issues.
> >
> > Thanks,
> > Karl
> >
> >
>
>
>
>
>
>
>
>
> > -----Original Message-----
> > From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > Sent: Thursday, April 22, 2010 11:32 AM
> > To: dev@lucene.apache.org
> > Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > Subject: Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for your detailed explanation - really good!
> >
> > As I've thought through some of the implications, I've added comments
> > below, so I hope they don't seem too jumbled...
> >
> > I suppose on the 'authority' side, it works kind of as I envisioned it
> > would.
> >
> > For general Solr access control, there's two layers of security that
> > need to be addressed:
> >  1. Authentication - make sure the incoming query is from a valid
> > user, and the passed-in credentials (hash, certificate etc.) are
> > correct  2. Query filtering - potentially reduce the number/type of
> > returned results based on the allow/deny metadata for the
> > authenticated user
> >
> > I can see how the LCF auth connector works for 2., but can it do 1. as
> > well?
> > It would be good if this could somehow be integrated into any
> > container (Tomcat/Jetty et al) authentication that might be configured
> > (probably related to your previous post). I many ways, it could/should
> > be that the Authority (AD) part of the connector should only be
> > concerned with 1. and not 2. (see below).
> >
> > So, on the repository side, there is also an LCF connector that
> > 'closes the loop' to provide the 'what is it I'm trying to control' side
> of things.
> > I understand that LCF doesn't do the mapping - it delegates this task
> > to the caller, but provides both sides of the equation (authority &
> > repository).
> >
> > >>>>>
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> > <<<<<
> > I think this is the bit that is worrying me - is this storing the SIDs
> > into Solr at document index time? This would be a problem for a whole
> > load of reasons, but maybe I'm missing something here? (see below for
> > a possible
> > alternative)
> >
> > Basically, what I'm getting at here is that the allow/deny values need
> > to be stored in one of three places:
> >  1. In the authority (e.g. inside AD)
> >  2. In the document metadata (index-time)  3. In external storage
> > (e.g. acl.xml, NTFS etc.)
> >
> > 1. Extending AD is pretty much out, as this causes too many interop
> > problems 2. 'Hard-coding' acl information in the index makes it
> > non-portable, resistent to changes, etc.
> > 3. acl.xml is coupled with a Solr instance, but is easily
> > ported/replicated.
> > Storing/retrieving acl information from the source (e.g. NTFS) is
> > problematic, as the source may not be accessible (it may not even exist).
> >
> > I believe 3. or a variant is the way to go on the repo side, which
> > means the LCF Authority connector is mainly for Authentication (see
> > above), which is what you want from AD et al integration.
> > The problem that arises from 'pluggable' authentication is that, if
> > you're not using a certificate, you have to start with a password, but
> > the connector only has access to the password hash (unless the pwd is
> > sent in the query url). I don't know of a way to confirm identities in
> > AD using only the username and hash (AD does the hash compare). I
> > believe this is where container-based integration will likely work
> better.
> >
> > So that I can confirm my understanding...a scenario might be like this:
> >
> > We have an AD connector that fetches the SIDs and we can read them etc.
> > For my environment, where there are no 'files' (there's only a
> > transient network stream), we have an LCF 'Solr Field Filter Query'
> > connector that decides which Filter Queries to apply (allow and deny)
> > for the passed in SID(s).
> >
> > For another environment, let's say, NTFS, there might be an 'NTFS'
> > connector that would provide some kind of mapping of files/folders to
> > SID(s). Since Solr wouldn't intrinscially know about this, the acl
> > information would need to be stored somewhere in the index. This would
> > mean extending the Solr schema and storing metadata at index time.
> > The alternative is to re-use the 'Solr Field Filter Query' connector
> > for this as well (and any other document types that might be read in).
> > This keeps the index 'clean' of acl-specific metadata, and allows for
> > in-place changes and easy cross-document/index/instance access control.
> >
> >
> > If the above interpretation is [roughly] correct (please let me know
> > if I've got this wrong!), this would reduce down to having:
> >   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> > (possibly/partly at the container level)
> >   2. At least an LCF Repository connector for 'acl.xml'
> >   3. Optional other LCF Repository connectors
> >
> > It sounds like you've now finished the first half of 1. by adding the
> > ability to get the required auth data from a Solr api call. The other
> > half of 1. will be implementing the LCF interface in the
> > SolrACLSecurity class, to effectively replace the 'user', 'group' and
> 'password' bits of acl.xml.
> >
> > Does the above sound like an accurate interpretation? Just trying to
> > get a good picture of what work needs doing, where it goes, etc.
> >
> > Many thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
> >
> > >  >>>>>>
> > > What is the relationship between stored data (documents) and
> authorities'
> > > access/deny attributes? (do you have any examples of what an
> > > access_token value might contain?) <<<<<<
> > >
> > > Documents have access/deny attributes; authorities simply provide
> > > the list of tokens that belong to an authenticated user.  Thus,
> > > there's no access/deny for an authority; that's attached to the
> > > document (as it is in real-world repositories).
> > >
> > > Let's run a quick example, using Active Directory and a Windows file
> > > system.  Suppose that you have a directory with documents in it,
> > > call it DirectoryA, and the directory allows read access to the
> > > following
> > SIDs:
> > >
> > > S-123-456-76890
> > > S-23-64-12345
> > >
> > > These SIDs correspond to active directory groups, let's call them
> > > Group1 and Group2, respectively.
> > >
> > > DirectoryB also has documents in it, and those documents have just
> > > the SID S-123-456-76890 attached, because only Group1 can read its
> contents.
> > >
> > > Now, pretend that someone has created an LCF Active Directory
> > > authority connection (in the LCF UI), which is called "myAD", and
> > > this connection is set up to talk to the governing AD domain
> > > controller for this Windows file system.  We now know enough to
> > > describe the document
> > indexing process:
> > >
> > > - Each file in DirectoryA will have the following
> > > __ALLOW_TOKEN__document attributes inside Solr:
> > > "myAD:S-123-456-76890",
> > and "myAD:S-23-64-12345".
> > > - Each file in DirectoryB will have the following
> > > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> > >
> > > Now, suppose that a user (let's call him "Peter") is authenticated
> > > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > > are
> > (say):
> > >
> > > S-1-1-0 (the 'everyone' SID)
> > > S-323-999-12345 (his own personal user SID)
> > > S-23-64-12345 (the SID he gets because he belongs to group 2)
> > >
> > > We want to look up the documents in the search index that he can see.
> > > So, we ask the LCF authority service what his tokens are, and we get
> > back:
> > >
> > > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> > >
> > > The documents we should return in his search are the ones matching
> > > his search criteria, PLUS the intersection of his tokens with the
> > > document ALLOW tokens, MINUS the intersection of his tokens with the
> > > document DENY tokens (there aren't any involved in this example).
> > > So only files that have one of his three tokens as an ALLOW
> > > attribute would be
> > returned.
> > >
> > > Note that what we are attempting to do is enforce AD's security with
> > > the search results we present.  There is no need to define a whole
> > > new security mechanism, because AD already has one that people use.
> > >
> > > >>>>>>
> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > > to ensure there are no security or other dependencies of indexed
> > > data with any external repository - most notably the file system.
> > > There are many reasons for wanting this, but one of the main ones is
> > > that Solr-stored data is not always based on file data (or
> > > accessible
> > file data).
> > > In fact, in my particular case, almost none of the indexed data
> > > comes from files.
> > > <<<<<<
> > >
> > > LCF is all about abstracting from repositories.  It's not
> > > specifically about a file system, although that is a convenient
> > > example.  If you are building your own kind of repository with your
> > > own security setup, that's fine - but in the LCF world you'd need to
> > > create an authority connector for your repository (which maybe reads
> > > your acl.xml file), as well as a repository connector (which hands
> > > documents to LCF and provides it with the access tokens that make
> > > security work).  Of course, you can something much lighter that
> > > doesn't include LCF at all if you are just integrating a custom
> > > repository of your own, but it sounded like you were interested in the
> broader problem here.
> > >
> > > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > > connectors to work cooperatively to define access tokens in a way
> > > that is consistent from authority connector to repository connector
> > > for a given repository kind.  Anybody can write a connector, so the
> > > beauty of all this is that you can build a system where data from
> > > many disparate sources is indexed, and security for each is
> > > simultaneously
> > enforced.
> > >
> > > Karl
> > >
> > >
> > >  ------------------------------
> > > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > > *Sent:* Thursday, April 22, 2010 9:24 AM
> > >
> > > *To:* dev@lucene.apache.org
> > > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > > connectors-dev@incubator.apache.org
> > > *Subject:* Re: FW: Solr and LCF security at query time
> > >
> > > Hi Karl,
> > >
> > > Thanks very much for the diagram -
> > > Sorry about all the questions, but this raises a few new ones...
> > >
> > > What is the relationship between stored data (documents) and
> authorities'
> > > access/deny attributes? (do you have any examples of what an
> > > access_token value might contain?)
> > >
> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > > to ensure there are no security or other dependencies of indexed
> > > data with any external repository - most notably the file system.
> > > There are many reasons for wanting this, but one of the main ones is
> > > that Solr-stored data is not always based on file data (or
> > > accessible
> > file data).
> > > In fact, in my particular case, almost none of the indexed data
> > > comes from files.
> > >
> > > This is one reason why SOLR-1872 uses filter queries for its
> > > access/deny tokens - so that all the required information for access
> > > control completely resides within the Solr index itself.
> > > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > > and users, some external 'repository' (files) and users, or
> > > arbitrary
> > data (e.g.
> > > either of these)?
> > >
> > > I hope that makes sense...
> > >
> > > Thanks!
> > > Peter
> > >
> > >
> > >
> > >
> > > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> > >
> > >> Hi Peter,
> > >>
> > >> I've attached a diagram that is not in the wiki as of yet, and I'll
> > >> try to answer your questions.
> > >>
> > >> >>>>>>
> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> > >> stored for a particular user in the underlying acl store (e.g.
> > >> Active
> > Directory)?
> > >> How does AD and/or LCF handle storing such data in its schema?
> > >> (does AD needs its schema extended?) Presumably, any such AD fields
> > >> would need to be queried for effective rights in order to cater for
> > >> group membership allows and denies.
> > >> <<<<<<
> > >>
> > >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> > >> strings that represent a contract between an LCF authority
> > >> connection and the LCF repository connection that picks up the
> > >> documents (from
> > wherever).
> > >>  These tokens thus have no real meaning outside of LCF.  You must
> > >> regard them as opaque.
> > >>
> > >> The contract, however, states that if you use the LCF authority
> > >> service to obtain tokens for an authenticated user, you will get
> > >> back a set that is CONSISTENT with the tokens that were attached to
> > >> the documents LCF sent to Solr for indexing in the first place.
> > >> So, you don't have to worry about it, and that's kind of the idea.
> > >> So you
> > imagine the following flow:
> > >>
> > >> (1) Use LCF to fetch documents and send them to Solr
> > >> (2) When searching, use the LCF authority service to get the
> > >> desired user's access tokens
> > >> (3) Either filter the results, or modify the query, to be sure the
> > >> access tokens all match up properly
> > >>
> > >> For the AD authority, the LCF access tokens consist, in part, of
> > >> the user's SIDs.  For other authorities, the access tokens are
> > >> wildly
> > different.
> > >>  You really don't want to know what's in them, since that's the job
> > >> of the LCF authority to determine. ;-)
> > >>
> > >> LCF is not, by the way, joined at the hip with AD.  However, in
> > >> practice, most enterprises in the world use some form of AD single
> > >> signon for their web applications, and even if they're using some
> > >> repository with its own idea of security, there's a mapping between
> > >> the AD users and the repository's users.  Doing that mapping is
> > >> also the job of the LCF authority for that repository.
> > >>
> > >> Hope this helps.  Also, I'm not expecting time miracles here, so
> > >> don't sweat the schedule.
> > >>
> > >>
> > >> Karl
> > >>
> > >>
> > >> ________________________________________
> > >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> > >> Sent: Thursday, April 22, 2010 4:27 AM
> > >> To: dev@lucene.apache.org
> > >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > >> connectors-dev@incubator.apache.org
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> Hi Karl,
> > >>
> > >> Thanks for the quick turnaround.
> > >> I'm in the middle of a product release for us, so I fear I won't be
> > >> as quick as you... :-)
> > >>
> > >> I couldn't find a simple flow diagram or similar for LCF with
> > >> regards security (probably looking in the wrong place).
> > >> Perhaps you could help on these questions...?
> > >>
> > >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> > >> sub-queries, which are then used as filter queries in a user's search.
> > >>
> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> > >> stored for a particular user in the underlying acl store (e.g.
> > >> Active
> > Directory)?
> > >> How does AD and/or LCF handle storing such data in its schema?
> > >> (does AD needs its schema extended?) Presumably, any such AD fields
> > >> would need to be queried for effective rights in order to cater for
> > >> group membership allows and denies.
> > >>
> > >> I guess I'm just trying to understand the architectural
> > >> flow/storage/retrieval of data in the various parts of the system,
> > >> but I admit, I need to do more research on this.
> > >> After our product release, when I get a few more spare cycles, I
> > >> can look at it in more detail.
> > >>
> > >> Many thanks!
> > >> Peter
> > >>
> > >>
> > >>
> > >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> Hi Peter,
> > >>
> > >> I just committed the promised changes to the LCF Solr output
> connector.
> > >>
> > >> ACL metadata will now be posted to the Solr Http interface along
> > >> with the document as the two following fields:
> > >>
> > >> __ACCESS_TOKEN__document
> > >> __DENY_TOKEN__document
> > >>
> > >> There will, of course, potentially be multiple values for each of
> > >> these two fields.
> > >>
> > >> Hope this helps,
> > >> Karl
> > >>
> > >> ________________________________
> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> > >> peter.sturge@googlemail.com>]
> > >> Sent: Tuesday, April 20, 2010 6:51 PM
> > >>
> > >> To: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> Hi Karl,
> > >>
> > >> Thanks for the info. I'll have a look at the link and try to take
> > >> in as much sugar as my insulin levels will handle...
> > >> It sounds like the necessary interface(s) are already in LCF - just
> > >> a matter of implementing them in the Solr 1872 plugin.
> > >> I'll need to digest the LCF stuff to get to grips with it..please
> > >> bear with me while I do that...
> > >>
> > >> When you say:
> > >>   The LCF solr output connection doesn't yet do this, but it is
> > >> trivial for me to make that happen.
> > >> Do you mean a mechanism by which solr.war can get url et al info
> > >> from its parent container (Tomcat, Jetty etc.), or have I
> > >> misinterpreted
> > this?
> > >>
> > >>
> > >> Thanks,
> > >> Peter
> > >>
> > >>
> > >>
> > >>
> > >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> Hi Peter,
> > >>
> > >> I'm the principal committer for LCF, but I don't know as much about
> > >> Solr as I ought to, so it sounds like a potentially productive
> > collaboration.
> > >>
> > >> LCF does exactly what you are looking for - the only issue at all
> > >> is that you need to fetch a URL from a webapp to get what you are
> > >> looking for.  The "plugs" are all inside LCF for different kinds of
> > >> repositories.  Here's a link that might help with drinking the LCF
> > "koolaid", as it were:
> > >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> > >> ct
> > >> ors+Framework+concepts
> > >>
> > >> The url would be something like this (on a locally installed
> > >> tomcat-based LCF instance):
> > >>
> > >>
> > >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> > >> se
> > >> rname@somedomain.com
> > >>
> > >> ... and this fetch returns something like:
> > >>
> > >> TOKEN:xxxxxxx
> > >> TOKEN:yyyyyyy
> > >> TOKEN:zzzzzzz
> > >> ....
> > >>
> > >> ... which represent the amalgamated tokens for all of the defined
> > >> authorities, and by some strange coincidence ( ;-) ) are compatible
> > >> with certain pieces of metadata that have been passed into Solr
> > >> with each document - one set of Allow tokens, and a second set of
> > >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> > >> but it is trivial for me to make that happen.
> > >>
> > >> Does this sound plausible to you?
> > >>
> > >> Karl
> > >>
> > >>
> > >> ________________________________
> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> > >> peter.sturge@googlemail.com>]
> > >> Sent: Tuesday, April 20, 2010 5:41 PM
> > >> To: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> > >> dev@lucene.apache.org>
> > >>
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> Hi Karl,
> > >>
> > >> Integrating LCF to get external token support for SOLR-1872 sounds
> > >> very interesting indeed. I don't know anything about LCF, but one
> > >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> > >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> > >> series of plugins that could be used for obtaining back-end
> > >> authentication
> > information.
> > >>
> > >> If you're good with LCF, perhaps we could work together to build
> > >> this
> > in.
> > >> One of the first things would be defining an interface that would
> > >> be as easy as possible to plug LCF into. Have you any
> > >> suggestions/insight on this front?
> > >>
> > >> Many thanks,
> > >> Peter
> > >>
> > >>
> > >>
> > >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> SOLR-1872 looks exactly like what I was envisioning, from the
> > >> search query perspective, although instead of the acl xml file you
> > >> specify LCF stipulates you would dynamically query the
> > >> lcf-authority-service servlet for the access tokens themselves.
> > >> That would get you support for AD, Documentum, LiveLink, Meridio,
> > >> and Memex for free. It seems likely that this component could be
> > >> modified to work with LCF with minor
> > effort.
> > >>
> > >> The missing component still seems to be AD authentication, which
> > >> needs a solution.
> > >>
> > >> Karl
> > >>
> > >> ________________________________
> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> > >> peter.sturge@googlemail.com>]
> > >> Sent: Tuesday, April 20, 2010 10:44 AM
> > >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> If you want to do this completely within Solr, have a look at:
> > >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> > >>
> > >> Thanks,
> > >> Peter
> > >>
> > >>
> > >>
> > >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> FYI
> > >>
> > >> ________________________________
> > >> From: Wright Karl (Nokia-S/Cambridge)
> > >> Sent: Tuesday, April 20, 2010 8:16 AM
> > >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> > >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> > >> connectors-dev@incubator.apache.org<mailto:
> > >> connectors-dev@incubator.apache.org>'; '
> > >> connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>'
> > >> Subject: RE: Solr and LCF security at query time
> > >>
> > >> Dominique,
> > >>
> > >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> > >> establishes a powerful multi-repository security model, even though
> > >> it doesn't yet do the final step of enforcing that model at the
> > >> search end.  LCF allows you to define multiple authorities to
> > >> operate against disparate repositories, and use the appropriate
> > >> authority to secure any given document.  The solr people are aware
> > >> of this design, which addresses the issues raised by SOLR-1834 very
> > >> nicely.  However, as I said before, time is a problem, and the work
> > >> still needs to be
> > done.
> > >>
> > >> I suggest you read up on the actual security model of LCF, and
> > >> perhaps experiment with that and the SOLR-1834 contribution, to see
> > >> if there is common ground.  One thing we've learned at MetaCarta is
> > >> that post-filtering for security purposes is expensive, and it is
> > >> better to modify the queries themselves to restrict the results, if
> > >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> > >> sounds like it might be the filtering approach.  Still, it would be
> > better than nothing.
> > >>
> > >> Please let me know what you find out.
> > >>
> > >> Thanks,
> > >> Karl
> > >>
> > >> ________________________________
> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> > >> dominique.bejean@eolya.fr>]
> > >> Sent: Tuesday, April 20, 2010 8:03 AM
> > >> To: Wright Karl (Nokia-S/Cambridge)
> > >> Cc: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>;
> > >> connectors-dev@incubator.apache.org<mailto:
> > >> connectors-dev@incubator.apache.org>
> > >> Subject: Re: Solr and LCF security at query time
> > >>
> > >> Karl,
> > >>
> > >> Thank you for your reply.
> > >>
> > >> I made some research today and I found this :
> > >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> > >> 83
> > >> 4 http://demo.findwise.se:8880/SolrSecurity/
> > >>
> > >> Sorl security model have to be able to filter result list with
> > >> items coming from various sources at the same time (livelink,
> > >> documentum, file system, ...). Big subject :)
> > >>
> > >> Dominique
> > >>
> > >>
> > >> Le 20/04/10 13:34,
> > >> karl.wright@nokia.com<ma...@nokia.com> a ?crit :
> > >> Hi Dominique,
> > >>
> > >> At the moment, in order to enforce the LCF security model within
> > >> Lucene/Solr, you will need to build this functionality into
> > >> whatever client you are using to display the Lucene search results.
> > >> Specifically, you would need to take the following steps:
> > >>
> > >> (1) Have your users access your search client through Apache.
> > >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> > >> mod_authz_annotate, to cause authorization HTTP headers to be
> > >> transmitted to the client webapp.
> > >> (3) Have your client webapp alter whatever queries it is doing, to
> > >> add an appropriate query clause for each of the access tokens
> > >> transmitted in the headers.
> > >>
> > >> (This is how it is done at MetaCarta.)
> > >>
> > >> Alternatively, you may find a way to do this completely with a web
> > >> application under a Java app server such as Tomcat.  I have not yet
> > >> done the research to find out whether this is a feasible alternative.
> > >> Effectively, what you need something like mod_auth_kerb to do is to
> > >> authenticate your user against Active Directory, or whomever the
> > authenticator ought to be.
> > >>  JAAS may be helpful here.
> > >>
> > >> There are, of course, intentions to fill out the missing pieces
> > >> more completely and transparently via a Solr search plugin and/or
> filter.
> > >> What has been lacking is time.  If you are in a position to do
> > >> development in this area, we're happy to have any assistance you
> > >> might
> > provide.
> > >>
> > >> Thanks,
> > >> Karl
> > >> ________________________________
> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> > >> Sent: Tuesday, April 20, 2010 5:06 AM
> > >> To: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>
> > >>  Subject: Solr and LCF security at query time
> > >>
> > >> Hi,
> > >>
> > >> I don't see in LCF wiki how Solr and LCF works together at query
> > >> time in order to remove from the result list the items the user is
> > >> not allowed to access.
> > >>
> > >> In
> > >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> > >> ep
> > >> ts.html,
> > >> I just see these sentences :
> > >>
> > >> " Once all these documents and their access tokens are handed to
> > >> the search engine, it is the search engine's job to enforce
> > >> security by excluding inappropriate documents from the search
> > >> results. For Lucene, this infrastructure is expected to be built on
> > >> top of Lucene's generic metadata abilities, but has not been
> > >> implemented at
> > this time."
> > >>
> > >> I am not sure to understand. Does this mean that for the moment, it
> > >> is not possible for Solr to apply security by using an Authority
> > Connector ?
> > >>
> > >> Dominique
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> -------------------------------------------------------------------
> > >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> > >> additional commands, e-mail: dev-help@lucene.apache.org
> > >>
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> > additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Apologies for the delayed reply. I've been away on business, and in the
middle of a product release, so it's been a busy time...

In response to your eariler questions:

The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses,
although the point at which these are done is slightly different.

I think I am correct in my understanding that with filter queries, the
results are filtered 'post-Lucene', but are separately (Solr) cached, so you
get a hit on the first search, but then benefit from cached hits on
subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are
applied at the Lucene query directly, so don't have separate Solr caching.
I've not benchmarked the two, so one or other might be slower/faster for
various search scenarios.

In any case, I believe either technique can be employed in either 1834 or
1872.


With regards schema extension, I believe we need to be very careful here, as
requiring index-time storage of access control data will pose a problem for
any use cases where the access control needs to change (maybe often, maybe
only occasionally). I'm trying to think of a use case where this wouldn't at
least potentially be the case, and I can't think of one, but perhaps I'm not
truly understanding what exactly is stored in the __ALLOW_TOKEN__ and
__DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in
(e.g. let's say someone has left my organization, do I have to update
documents to remove his/her access?).

Also, would such indexed tokens be entirely 'document-context-free'? I.e.
Would the same type/format of tokens be used for data from different sources
(e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be
compatible with multiple and/or changing authorities (e.g. AD, documentum,
LDAP, custom, etc.)?

I like the idea of an LCF plugin to hold the acl data. I admit, I've not had
enough time to look into how this might look at the moment, but it sounds
like it could be a good way to hold generic (authority-agnostic) acl data,
and [hopefully] not have to tie it to document data at index-time.

I hope this makes sense, but if I've misunderstood the proposed mechanism,
please correct me. Would the __ALLOW_TOKEN__ et al fields store, for
example, SID information?


Thanks,
Peter



On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com> wrote:

> Ok, not hearing back from Peter, I've done some Solr research and written
> some code that might work.  The approach I've taken is most similar to SOLR
> 1834, other than the LCF-centric logic.  Hopefully there will be a chance to
> try this out in a full end-to-end way  on the weekend, after which I will
> submit it to the Solr team (where I think it most naturally would be built
> and delivered).
>
> What it's going to need is either a static or dynamic schema addition to
> define __ALLOW_TOKEN__document, __DENY_TOKEN__document,
> __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be
> string, multivalued fields (I think).  It would be great if these could be
> made a default part of Solr; similarly, it would be good if the new search
> component was predelivered with Solr and mentioned (even if commented out)
> in the example solrconfig.xml file.  The only other thing that needs to be
> done to hook up the search component is to include a configuration parameter
> describing the base URL of the LCF authority service.  Plus, as I said
> earlier, we still don't have a canned solution for authentication yet -
> although I feel that will be straightforward.
>
> Comments welcome...
> Karl
>
>
> ________________________________________
> From: Wright Karl (Nokia-S/Cambridge)
> Sent: Tuesday, April 27, 2010 8:20 AM
> To: connectors-dev@incubator.apache.org; dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org
> Subject: RE: FW: Solr and LCF security at query time
>
> Hi Peter,
>
> I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions
> in detail, and have a couple of SOLR-related questions.
>
> Both contributions rely on a SearchComponent to work their magic.  However,
> it also appears that each modifies the user query in a different way.  1834
> uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND
> and OR filterquery clauses.  Both of them are constructed using Solr
> FilterQuery objects.  Here are my questions:
>
> (1) I am not conversant enough with Solr yet to know the difference between
> the different kinds of clause structure.  Do you know if there is a
> difference?  For example, is there any possibility that AND/OR clauses can
> permit documents to be seen that should not be seen?  (MUST and MUST_NOT
> sound a lot more definite...)
>
> (2) Are Solr FilterQuery objects applied to constructing the query that
> will be sent to Lucene?  Or are they applied by Solr after-the-fact to the
> resultset?  Or, is it a combination of the two, depending on the details of
> your actual filter clause?
>
> I also haven't heard much from you in the last week or so - have you
> thought further about what you intend to do, and can you let me know whether
> you are still interested in developing an LCF plugin for Solr?
>
> Thanks,
> Karl
>
> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 12:23 PM
> To: dev@lucene.apache.org
> Cc: connectors-dev@incubator.apache.org;
> connectors-user@incubator.apache.org; lucene-dev@apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> See inline...
>
> On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:
>
> > Hi Peter,
> >
> > The authority connectors don't perform authentication at this time.
> > In fact, LCF has nothing to do with authentication at all - just
> authorization.
> >  The reason for this is because it is almost never the case that
> > somebody wants to provide multiple credentials in order to be able see
> their results.
> >  Most enterprises who have multiple repositories authenticate against
> > AD and then map AD user names to repository user names in order to
> > access those repositories.  If you noted my earlier posts from this
> > morning, you may have noted that I'm looking at recommending JAAS plus
> > sun's kerb5 login module for handling the "authenticate against AD"
> > case, which would cover some 95%+ of the real world authentication needed
> out there.
> >
> >
> I did read your earlier post regarding this, and I totally agree with you -
> this is best handled 'upstream'. In fact, I use a JAAS plugin in other
> places in the product (not Solr) for authentication.
>
>
> >
> > Yes, the idea is to store SIDs in solr at index time.  I don't know
> > enough about solr to know what kinds of issues this might entail, but
> > Lucene certainly has a model of metadata that's pretty flexible, so I
> > don't think this would be difficult at all.  Eric Hatcher also seemed
> > to confirm my suspicions that this would not be a problem.
> >
>
> It's certainly not a problem to store this data in Solr. The problem is
> more that you don't really *want* to store this data at index time.
> There are lots of reasons for not wanting to 'hard-code' SID data with
> documents in the index. Here's just a few:
>  * What happens if/when you want to add explicit user access to some [group
> of] documents ? (i.e. not via a group)
>  * What happens if you need to revoke or change a user's or group's access?
>  * It's difficult to move/replicate the index to another domain
>  * For AD, SIDs are generally not meant to be stored long term outside of
> AD, as they can be changed (this doesn't happen often, but it can happen
> after an AD rebuild, domain type upgrade, data recovery etc.)
>
> These and other senarios mean re-indexing the stored data. When the index
> is huge, this is non-trivial (time-wise). There are not uncommon scenarios
> where user/group access control can change multiple times in one day.
>
> There might be a way of storing acl data in a payload or similar, but I'm
> not sure how that would work across millions of [arbitrarily grouped ]
> documents (I'm not familiar enough with payloads to know if this would be a
> good or bad idea).
>
>
> >
> > This is exactly why I think that we need to do the authentication
> > upstream of the authority world.
> >
> >
> Agreed.
>
>
>
> >
> > If Solr handles arbitrary document metadata, then I think we could
> > just use that feature.  But you know more about it than me, at this
> > point.  It would be great to get an overview of potential ways of doing
> this.
> >
> >
> Payloads, maybe?
>
>
> >
> > For your particular task, it sounds like you are trying to read from
> > NTFS and apply security after-the-fact with some acl specification
> > file.  In that case, I'd write a repository connector that was based
> > on the file system connector (already part of the stable of connectors
> > for LCF) which reads ACL information from your acl.xml file.  Or, if
> > you prefer a UI for specifying ACL information, you could extend the
> > connector so that security is configured in the UI without having an
> > external acl.xml file at all - which would be a nice addition to the
> > existing file system connector.  (Repository connections and jobs are
> > configured internally in LCF by XML documents stored in the database,
> > so they can be arbitrarily structured.  I'm happy to help you figure
> > out how to do this if this is what you decide to do.)
> >
> > For my particular requirements, there are no files -  the data is
> > generated
> from the network and stored. After the fact, there is no persistent
> location of this data other than in Solr.
>
> Storing the acl info using the connector sounds very interesting. Could be
> worth looking at in more details. Thanks!
>
>
>
> > I think we still need to add in the authentication piece to make this
> > all work for you, so perhaps you can describe how you expect a user to
> > interact with your system, so I can understand your design issues.
> >
> > Thanks,
> > Karl
> >
> >
>
>
>
>
>
>
>
>
> > -----Original Message-----
> > From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > Sent: Thursday, April 22, 2010 11:32 AM
> > To: dev@lucene.apache.org
> > Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > Subject: Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for your detailed explanation - really good!
> >
> > As I've thought through some of the implications, I've added comments
> > below, so I hope they don't seem too jumbled...
> >
> > I suppose on the 'authority' side, it works kind of as I envisioned it
> > would.
> >
> > For general Solr access control, there's two layers of security that
> > need to be addressed:
> >  1. Authentication - make sure the incoming query is from a valid
> > user, and the passed-in credentials (hash, certificate etc.) are
> > correct  2. Query filtering - potentially reduce the number/type of
> > returned results based on the allow/deny metadata for the
> > authenticated user
> >
> > I can see how the LCF auth connector works for 2., but can it do 1. as
> > well?
> > It would be good if this could somehow be integrated into any
> > container (Tomcat/Jetty et al) authentication that might be configured
> > (probably related to your previous post). I many ways, it could/should
> > be that the Authority (AD) part of the connector should only be
> > concerned with 1. and not 2. (see below).
> >
> > So, on the repository side, there is also an LCF connector that
> > 'closes the loop' to provide the 'what is it I'm trying to control' side
> of things.
> > I understand that LCF doesn't do the mapping - it delegates this task
> > to the caller, but provides both sides of the equation (authority &
> > repository).
> >
> > >>>>>
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> > <<<<<
> > I think this is the bit that is worrying me - is this storing the SIDs
> > into Solr at document index time? This would be a problem for a whole
> > load of reasons, but maybe I'm missing something here? (see below for
> > a possible
> > alternative)
> >
> > Basically, what I'm getting at here is that the allow/deny values need
> > to be stored in one of three places:
> >  1. In the authority (e.g. inside AD)
> >  2. In the document metadata (index-time)  3. In external storage
> > (e.g. acl.xml, NTFS etc.)
> >
> > 1. Extending AD is pretty much out, as this causes too many interop
> > problems 2. 'Hard-coding' acl information in the index makes it
> > non-portable, resistent to changes, etc.
> > 3. acl.xml is coupled with a Solr instance, but is easily
> > ported/replicated.
> > Storing/retrieving acl information from the source (e.g. NTFS) is
> > problematic, as the source may not be accessible (it may not even exist).
> >
> > I believe 3. or a variant is the way to go on the repo side, which
> > means the LCF Authority connector is mainly for Authentication (see
> > above), which is what you want from AD et al integration.
> > The problem that arises from 'pluggable' authentication is that, if
> > you're not using a certificate, you have to start with a password, but
> > the connector only has access to the password hash (unless the pwd is
> > sent in the query url). I don't know of a way to confirm identities in
> > AD using only the username and hash (AD does the hash compare). I
> > believe this is where container-based integration will likely work
> better.
> >
> > So that I can confirm my understanding...a scenario might be like this:
> >
> > We have an AD connector that fetches the SIDs and we can read them etc.
> > For my environment, where there are no 'files' (there's only a
> > transient network stream), we have an LCF 'Solr Field Filter Query'
> > connector that decides which Filter Queries to apply (allow and deny)
> > for the passed in SID(s).
> >
> > For another environment, let's say, NTFS, there might be an 'NTFS'
> > connector that would provide some kind of mapping of files/folders to
> > SID(s). Since Solr wouldn't intrinscially know about this, the acl
> > information would need to be stored somewhere in the index. This would
> > mean extending the Solr schema and storing metadata at index time.
> > The alternative is to re-use the 'Solr Field Filter Query' connector
> > for this as well (and any other document types that might be read in).
> > This keeps the index 'clean' of acl-specific metadata, and allows for
> > in-place changes and easy cross-document/index/instance access control.
> >
> >
> > If the above interpretation is [roughly] correct (please let me know
> > if I've got this wrong!), this would reduce down to having:
> >   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> > (possibly/partly at the container level)
> >   2. At least an LCF Repository connector for 'acl.xml'
> >   3. Optional other LCF Repository connectors
> >
> > It sounds like you've now finished the first half of 1. by adding the
> > ability to get the required auth data from a Solr api call. The other
> > half of 1. will be implementing the LCF interface in the
> > SolrACLSecurity class, to effectively replace the 'user', 'group' and
> 'password' bits of acl.xml.
> >
> > Does the above sound like an accurate interpretation? Just trying to
> > get a good picture of what work needs doing, where it goes, etc.
> >
> > Many thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
> >
> > >  >>>>>>
> > > What is the relationship between stored data (documents) and
> authorities'
> > > access/deny attributes? (do you have any examples of what an
> > > access_token value might contain?) <<<<<<
> > >
> > > Documents have access/deny attributes; authorities simply provide
> > > the list of tokens that belong to an authenticated user.  Thus,
> > > there's no access/deny for an authority; that's attached to the
> > > document (as it is in real-world repositories).
> > >
> > > Let's run a quick example, using Active Directory and a Windows file
> > > system.  Suppose that you have a directory with documents in it,
> > > call it DirectoryA, and the directory allows read access to the
> > > following
> > SIDs:
> > >
> > > S-123-456-76890
> > > S-23-64-12345
> > >
> > > These SIDs correspond to active directory groups, let's call them
> > > Group1 and Group2, respectively.
> > >
> > > DirectoryB also has documents in it, and those documents have just
> > > the SID S-123-456-76890 attached, because only Group1 can read its
> contents.
> > >
> > > Now, pretend that someone has created an LCF Active Directory
> > > authority connection (in the LCF UI), which is called "myAD", and
> > > this connection is set up to talk to the governing AD domain
> > > controller for this Windows file system.  We now know enough to
> > > describe the document
> > indexing process:
> > >
> > > - Each file in DirectoryA will have the following
> > > __ALLOW_TOKEN__document attributes inside Solr:
> > > "myAD:S-123-456-76890",
> > and "myAD:S-23-64-12345".
> > > - Each file in DirectoryB will have the following
> > > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> > >
> > > Now, suppose that a user (let's call him "Peter") is authenticated
> > > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > > are
> > (say):
> > >
> > > S-1-1-0 (the 'everyone' SID)
> > > S-323-999-12345 (his own personal user SID)
> > > S-23-64-12345 (the SID he gets because he belongs to group 2)
> > >
> > > We want to look up the documents in the search index that he can see.
> > > So, we ask the LCF authority service what his tokens are, and we get
> > back:
> > >
> > > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> > >
> > > The documents we should return in his search are the ones matching
> > > his search criteria, PLUS the intersection of his tokens with the
> > > document ALLOW tokens, MINUS the intersection of his tokens with the
> > > document DENY tokens (there aren't any involved in this example).
> > > So only files that have one of his three tokens as an ALLOW
> > > attribute would be
> > returned.
> > >
> > > Note that what we are attempting to do is enforce AD's security with
> > > the search results we present.  There is no need to define a whole
> > > new security mechanism, because AD already has one that people use.
> > >
> > > >>>>>>
> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > > to ensure there are no security or other dependencies of indexed
> > > data with any external repository - most notably the file system.
> > > There are many reasons for wanting this, but one of the main ones is
> > > that Solr-stored data is not always based on file data (or
> > > accessible
> > file data).
> > > In fact, in my particular case, almost none of the indexed data
> > > comes from files.
> > > <<<<<<
> > >
> > > LCF is all about abstracting from repositories.  It's not
> > > specifically about a file system, although that is a convenient
> > > example.  If you are building your own kind of repository with your
> > > own security setup, that's fine - but in the LCF world you'd need to
> > > create an authority connector for your repository (which maybe reads
> > > your acl.xml file), as well as a repository connector (which hands
> > > documents to LCF and provides it with the access tokens that make
> > > security work).  Of course, you can something much lighter that
> > > doesn't include LCF at all if you are just integrating a custom
> > > repository of your own, but it sounded like you were interested in the
> broader problem here.
> > >
> > > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > > connectors to work cooperatively to define access tokens in a way
> > > that is consistent from authority connector to repository connector
> > > for a given repository kind.  Anybody can write a connector, so the
> > > beauty of all this is that you can build a system where data from
> > > many disparate sources is indexed, and security for each is
> > > simultaneously
> > enforced.
> > >
> > > Karl
> > >
> > >
> > >  ------------------------------
> > > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > > *Sent:* Thursday, April 22, 2010 9:24 AM
> > >
> > > *To:* dev@lucene.apache.org
> > > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > > connectors-dev@incubator.apache.org
> > > *Subject:* Re: FW: Solr and LCF security at query time
> > >
> > > Hi Karl,
> > >
> > > Thanks very much for the diagram -
> > > Sorry about all the questions, but this raises a few new ones...
> > >
> > > What is the relationship between stored data (documents) and
> authorities'
> > > access/deny attributes? (do you have any examples of what an
> > > access_token value might contain?)
> > >
> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > > to ensure there are no security or other dependencies of indexed
> > > data with any external repository - most notably the file system.
> > > There are many reasons for wanting this, but one of the main ones is
> > > that Solr-stored data is not always based on file data (or
> > > accessible
> > file data).
> > > In fact, in my particular case, almost none of the indexed data
> > > comes from files.
> > >
> > > This is one reason why SOLR-1872 uses filter queries for its
> > > access/deny tokens - so that all the required information for access
> > > control completely resides within the Solr index itself.
> > > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > > and users, some external 'repository' (files) and users, or
> > > arbitrary
> > data (e.g.
> > > either of these)?
> > >
> > > I hope that makes sense...
> > >
> > > Thanks!
> > > Peter
> > >
> > >
> > >
> > >
> > > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> > >
> > >> Hi Peter,
> > >>
> > >> I've attached a diagram that is not in the wiki as of yet, and I'll
> > >> try to answer your questions.
> > >>
> > >> >>>>>>
> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> > >> stored for a particular user in the underlying acl store (e.g.
> > >> Active
> > Directory)?
> > >> How does AD and/or LCF handle storing such data in its schema?
> > >> (does AD needs its schema extended?) Presumably, any such AD fields
> > >> would need to be queried for effective rights in order to cater for
> > >> group membership allows and denies.
> > >> <<<<<<
> > >>
> > >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> > >> strings that represent a contract between an LCF authority
> > >> connection and the LCF repository connection that picks up the
> > >> documents (from
> > wherever).
> > >>  These tokens thus have no real meaning outside of LCF.  You must
> > >> regard them as opaque.
> > >>
> > >> The contract, however, states that if you use the LCF authority
> > >> service to obtain tokens for an authenticated user, you will get
> > >> back a set that is CONSISTENT with the tokens that were attached to
> > >> the documents LCF sent to Solr for indexing in the first place.
> > >> So, you don't have to worry about it, and that's kind of the idea.
> > >> So you
> > imagine the following flow:
> > >>
> > >> (1) Use LCF to fetch documents and send them to Solr
> > >> (2) When searching, use the LCF authority service to get the
> > >> desired user's access tokens
> > >> (3) Either filter the results, or modify the query, to be sure the
> > >> access tokens all match up properly
> > >>
> > >> For the AD authority, the LCF access tokens consist, in part, of
> > >> the user's SIDs.  For other authorities, the access tokens are
> > >> wildly
> > different.
> > >>  You really don't want to know what's in them, since that's the job
> > >> of the LCF authority to determine. ;-)
> > >>
> > >> LCF is not, by the way, joined at the hip with AD.  However, in
> > >> practice, most enterprises in the world use some form of AD single
> > >> signon for their web applications, and even if they're using some
> > >> repository with its own idea of security, there's a mapping between
> > >> the AD users and the repository's users.  Doing that mapping is
> > >> also the job of the LCF authority for that repository.
> > >>
> > >> Hope this helps.  Also, I'm not expecting time miracles here, so
> > >> don't sweat the schedule.
> > >>
> > >>
> > >> Karl
> > >>
> > >>
> > >> ________________________________________
> > >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> > >> Sent: Thursday, April 22, 2010 4:27 AM
> > >> To: dev@lucene.apache.org
> > >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > >> connectors-dev@incubator.apache.org
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> Hi Karl,
> > >>
> > >> Thanks for the quick turnaround.
> > >> I'm in the middle of a product release for us, so I fear I won't be
> > >> as quick as you... :-)
> > >>
> > >> I couldn't find a simple flow diagram or similar for LCF with
> > >> regards security (probably looking in the wrong place).
> > >> Perhaps you could help on these questions...?
> > >>
> > >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> > >> sub-queries, which are then used as filter queries in a user's search.
> > >>
> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> > >> stored for a particular user in the underlying acl store (e.g.
> > >> Active
> > Directory)?
> > >> How does AD and/or LCF handle storing such data in its schema?
> > >> (does AD needs its schema extended?) Presumably, any such AD fields
> > >> would need to be queried for effective rights in order to cater for
> > >> group membership allows and denies.
> > >>
> > >> I guess I'm just trying to understand the architectural
> > >> flow/storage/retrieval of data in the various parts of the system,
> > >> but I admit, I need to do more research on this.
> > >> After our product release, when I get a few more spare cycles, I
> > >> can look at it in more detail.
> > >>
> > >> Many thanks!
> > >> Peter
> > >>
> > >>
> > >>
> > >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> Hi Peter,
> > >>
> > >> I just committed the promised changes to the LCF Solr output
> connector.
> > >>
> > >> ACL metadata will now be posted to the Solr Http interface along
> > >> with the document as the two following fields:
> > >>
> > >> __ACCESS_TOKEN__document
> > >> __DENY_TOKEN__document
> > >>
> > >> There will, of course, potentially be multiple values for each of
> > >> these two fields.
> > >>
> > >> Hope this helps,
> > >> Karl
> > >>
> > >> ________________________________
> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> > >> peter.sturge@googlemail.com>]
> > >> Sent: Tuesday, April 20, 2010 6:51 PM
> > >>
> > >> To: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> Hi Karl,
> > >>
> > >> Thanks for the info. I'll have a look at the link and try to take
> > >> in as much sugar as my insulin levels will handle...
> > >> It sounds like the necessary interface(s) are already in LCF - just
> > >> a matter of implementing them in the Solr 1872 plugin.
> > >> I'll need to digest the LCF stuff to get to grips with it..please
> > >> bear with me while I do that...
> > >>
> > >> When you say:
> > >>   The LCF solr output connection doesn't yet do this, but it is
> > >> trivial for me to make that happen.
> > >> Do you mean a mechanism by which solr.war can get url et al info
> > >> from its parent container (Tomcat, Jetty etc.), or have I
> > >> misinterpreted
> > this?
> > >>
> > >>
> > >> Thanks,
> > >> Peter
> > >>
> > >>
> > >>
> > >>
> > >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> Hi Peter,
> > >>
> > >> I'm the principal committer for LCF, but I don't know as much about
> > >> Solr as I ought to, so it sounds like a potentially productive
> > collaboration.
> > >>
> > >> LCF does exactly what you are looking for - the only issue at all
> > >> is that you need to fetch a URL from a webapp to get what you are
> > >> looking for.  The "plugs" are all inside LCF for different kinds of
> > >> repositories.  Here's a link that might help with drinking the LCF
> > "koolaid", as it were:
> > >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> > >> ct
> > >> ors+Framework+concepts
> > >>
> > >> The url would be something like this (on a locally installed
> > >> tomcat-based LCF instance):
> > >>
> > >>
> > >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> > >> se
> > >> rname@somedomain.com
> > >>
> > >> ... and this fetch returns something like:
> > >>
> > >> TOKEN:xxxxxxx
> > >> TOKEN:yyyyyyy
> > >> TOKEN:zzzzzzz
> > >> ....
> > >>
> > >> ... which represent the amalgamated tokens for all of the defined
> > >> authorities, and by some strange coincidence ( ;-) ) are compatible
> > >> with certain pieces of metadata that have been passed into Solr
> > >> with each document - one set of Allow tokens, and a second set of
> > >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> > >> but it is trivial for me to make that happen.
> > >>
> > >> Does this sound plausible to you?
> > >>
> > >> Karl
> > >>
> > >>
> > >> ________________________________
> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> > >> peter.sturge@googlemail.com>]
> > >> Sent: Tuesday, April 20, 2010 5:41 PM
> > >> To: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> > >> dev@lucene.apache.org>
> > >>
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> Hi Karl,
> > >>
> > >> Integrating LCF to get external token support for SOLR-1872 sounds
> > >> very interesting indeed. I don't know anything about LCF, but one
> > >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> > >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> > >> series of plugins that could be used for obtaining back-end
> > >> authentication
> > information.
> > >>
> > >> If you're good with LCF, perhaps we could work together to build
> > >> this
> > in.
> > >> One of the first things would be defining an interface that would
> > >> be as easy as possible to plug LCF into. Have you any
> > >> suggestions/insight on this front?
> > >>
> > >> Many thanks,
> > >> Peter
> > >>
> > >>
> > >>
> > >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> SOLR-1872 looks exactly like what I was envisioning, from the
> > >> search query perspective, although instead of the acl xml file you
> > >> specify LCF stipulates you would dynamically query the
> > >> lcf-authority-service servlet for the access tokens themselves.
> > >> That would get you support for AD, Documentum, LiveLink, Meridio,
> > >> and Memex for free. It seems likely that this component could be
> > >> modified to work with LCF with minor
> > effort.
> > >>
> > >> The missing component still seems to be AD authentication, which
> > >> needs a solution.
> > >>
> > >> Karl
> > >>
> > >> ________________________________
> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> > >> peter.sturge@googlemail.com>]
> > >> Sent: Tuesday, April 20, 2010 10:44 AM
> > >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> If you want to do this completely within Solr, have a look at:
> > >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> > >>
> > >> Thanks,
> > >> Peter
> > >>
> > >>
> > >>
> > >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> FYI
> > >>
> > >> ________________________________
> > >> From: Wright Karl (Nokia-S/Cambridge)
> > >> Sent: Tuesday, April 20, 2010 8:16 AM
> > >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> > >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> > >> connectors-dev@incubator.apache.org<mailto:
> > >> connectors-dev@incubator.apache.org>'; '
> > >> connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>'
> > >> Subject: RE: Solr and LCF security at query time
> > >>
> > >> Dominique,
> > >>
> > >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> > >> establishes a powerful multi-repository security model, even though
> > >> it doesn't yet do the final step of enforcing that model at the
> > >> search end.  LCF allows you to define multiple authorities to
> > >> operate against disparate repositories, and use the appropriate
> > >> authority to secure any given document.  The solr people are aware
> > >> of this design, which addresses the issues raised by SOLR-1834 very
> > >> nicely.  However, as I said before, time is a problem, and the work
> > >> still needs to be
> > done.
> > >>
> > >> I suggest you read up on the actual security model of LCF, and
> > >> perhaps experiment with that and the SOLR-1834 contribution, to see
> > >> if there is common ground.  One thing we've learned at MetaCarta is
> > >> that post-filtering for security purposes is expensive, and it is
> > >> better to modify the queries themselves to restrict the results, if
> > >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> > >> sounds like it might be the filtering approach.  Still, it would be
> > better than nothing.
> > >>
> > >> Please let me know what you find out.
> > >>
> > >> Thanks,
> > >> Karl
> > >>
> > >> ________________________________
> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> > >> dominique.bejean@eolya.fr>]
> > >> Sent: Tuesday, April 20, 2010 8:03 AM
> > >> To: Wright Karl (Nokia-S/Cambridge)
> > >> Cc: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>;
> > >> connectors-dev@incubator.apache.org<mailto:
> > >> connectors-dev@incubator.apache.org>
> > >> Subject: Re: Solr and LCF security at query time
> > >>
> > >> Karl,
> > >>
> > >> Thank you for your reply.
> > >>
> > >> I made some research today and I found this :
> > >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> > >> 83
> > >> 4 http://demo.findwise.se:8880/SolrSecurity/
> > >>
> > >> Sorl security model have to be able to filter result list with
> > >> items coming from various sources at the same time (livelink,
> > >> documentum, file system, ...). Big subject :)
> > >>
> > >> Dominique
> > >>
> > >>
> > >> Le 20/04/10 13:34,
> > >> karl.wright@nokia.com<ma...@nokia.com> a ?crit :
> > >> Hi Dominique,
> > >>
> > >> At the moment, in order to enforce the LCF security model within
> > >> Lucene/Solr, you will need to build this functionality into
> > >> whatever client you are using to display the Lucene search results.
> > >> Specifically, you would need to take the following steps:
> > >>
> > >> (1) Have your users access your search client through Apache.
> > >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> > >> mod_authz_annotate, to cause authorization HTTP headers to be
> > >> transmitted to the client webapp.
> > >> (3) Have your client webapp alter whatever queries it is doing, to
> > >> add an appropriate query clause for each of the access tokens
> > >> transmitted in the headers.
> > >>
> > >> (This is how it is done at MetaCarta.)
> > >>
> > >> Alternatively, you may find a way to do this completely with a web
> > >> application under a Java app server such as Tomcat.  I have not yet
> > >> done the research to find out whether this is a feasible alternative.
> > >> Effectively, what you need something like mod_auth_kerb to do is to
> > >> authenticate your user against Active Directory, or whomever the
> > authenticator ought to be.
> > >>  JAAS may be helpful here.
> > >>
> > >> There are, of course, intentions to fill out the missing pieces
> > >> more completely and transparently via a Solr search plugin and/or
> filter.
> > >> What has been lacking is time.  If you are in a position to do
> > >> development in this area, we're happy to have any assistance you
> > >> might
> > provide.
> > >>
> > >> Thanks,
> > >> Karl
> > >> ________________________________
> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> > >> Sent: Tuesday, April 20, 2010 5:06 AM
> > >> To: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>
> > >>  Subject: Solr and LCF security at query time
> > >>
> > >> Hi,
> > >>
> > >> I don't see in LCF wiki how Solr and LCF works together at query
> > >> time in order to remove from the result list the items the user is
> > >> not allowed to access.
> > >>
> > >> In
> > >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> > >> ep
> > >> ts.html,
> > >> I just see these sentences :
> > >>
> > >> " Once all these documents and their access tokens are handed to
> > >> the search engine, it is the search engine's job to enforce
> > >> security by excluding inappropriate documents from the search
> > >> results. For Lucene, this infrastructure is expected to be built on
> > >> top of Lucene's generic metadata abilities, but has not been
> > >> implemented at
> > this time."
> > >>
> > >> I am not sure to understand. Does this mean that for the moment, it
> > >> is not possible for Solr to apply security by using an Authority
> > Connector ?
> > >>
> > >> Dominique
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> -------------------------------------------------------------------
> > >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> > >> additional commands, e-mail: dev-help@lucene.apache.org
> > >>
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> > additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Apologies for the delayed reply. I've been away on business, and in the
middle of a product release, so it's been a busy time...

In response to your eariler questions:

The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses,
although the point at which these are done is slightly different.

I think I am correct in my understanding that with filter queries, the
results are filtered 'post-Lucene', but are separately (Solr) cached, so you
get a hit on the first search, but then benefit from cached hits on
subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are
applied at the Lucene query directly, so don't have separate Solr caching.
I've not benchmarked the two, so one or other might be slower/faster for
various search scenarios.

In any case, I believe either technique can be employed in either 1834 or
1872.


With regards schema extension, I believe we need to be very careful here, as
requiring index-time storage of access control data will pose a problem for
any use cases where the access control needs to change (maybe often, maybe
only occasionally). I'm trying to think of a use case where this wouldn't at
least potentially be the case, and I can't think of one, but perhaps I'm not
truly understanding what exactly is stored in the __ALLOW_TOKEN__ and
__DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in
(e.g. let's say someone has left my organization, do I have to update
documents to remove his/her access?).

Also, would such indexed tokens be entirely 'document-context-free'? I.e.
Would the same type/format of tokens be used for data from different sources
(e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be
compatible with multiple and/or changing authorities (e.g. AD, documentum,
LDAP, custom, etc.)?

I like the idea of an LCF plugin to hold the acl data. I admit, I've not had
enough time to look into how this might look at the moment, but it sounds
like it could be a good way to hold generic (authority-agnostic) acl data,
and [hopefully] not have to tie it to document data at index-time.

I hope this makes sense, but if I've misunderstood the proposed mechanism,
please correct me. Would the __ALLOW_TOKEN__ et al fields store, for
example, SID information?


Thanks,
Peter



On Tue, Apr 27, 2010 at 10:21 PM, <ka...@nokia.com> wrote:

> Ok, not hearing back from Peter, I've done some Solr research and written
> some code that might work.  The approach I've taken is most similar to SOLR
> 1834, other than the LCF-centric logic.  Hopefully there will be a chance to
> try this out in a full end-to-end way  on the weekend, after which I will
> submit it to the Solr team (where I think it most naturally would be built
> and delivered).
>
> What it's going to need is either a static or dynamic schema addition to
> define __ALLOW_TOKEN__document, __DENY_TOKEN__document,
> __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be
> string, multivalued fields (I think).  It would be great if these could be
> made a default part of Solr; similarly, it would be good if the new search
> component was predelivered with Solr and mentioned (even if commented out)
> in the example solrconfig.xml file.  The only other thing that needs to be
> done to hook up the search component is to include a configuration parameter
> describing the base URL of the LCF authority service.  Plus, as I said
> earlier, we still don't have a canned solution for authentication yet -
> although I feel that will be straightforward.
>
> Comments welcome...
> Karl
>
>
> ________________________________________
> From: Wright Karl (Nokia-S/Cambridge)
> Sent: Tuesday, April 27, 2010 8:20 AM
> To: connectors-dev@incubator.apache.org; dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org
> Subject: RE: FW: Solr and LCF security at query time
>
> Hi Peter,
>
> I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions
> in detail, and have a couple of SOLR-related questions.
>
> Both contributions rely on a SearchComponent to work their magic.  However,
> it also appears that each modifies the user query in a different way.  1834
> uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND
> and OR filterquery clauses.  Both of them are constructed using Solr
> FilterQuery objects.  Here are my questions:
>
> (1) I am not conversant enough with Solr yet to know the difference between
> the different kinds of clause structure.  Do you know if there is a
> difference?  For example, is there any possibility that AND/OR clauses can
> permit documents to be seen that should not be seen?  (MUST and MUST_NOT
> sound a lot more definite...)
>
> (2) Are Solr FilterQuery objects applied to constructing the query that
> will be sent to Lucene?  Or are they applied by Solr after-the-fact to the
> resultset?  Or, is it a combination of the two, depending on the details of
> your actual filter clause?
>
> I also haven't heard much from you in the last week or so - have you
> thought further about what you intend to do, and can you let me know whether
> you are still interested in developing an LCF plugin for Solr?
>
> Thanks,
> Karl
>
> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 12:23 PM
> To: dev@lucene.apache.org
> Cc: connectors-dev@incubator.apache.org;
> connectors-user@incubator.apache.org; lucene-dev@apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> See inline...
>
> On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:
>
> > Hi Peter,
> >
> > The authority connectors don't perform authentication at this time.
> > In fact, LCF has nothing to do with authentication at all - just
> authorization.
> >  The reason for this is because it is almost never the case that
> > somebody wants to provide multiple credentials in order to be able see
> their results.
> >  Most enterprises who have multiple repositories authenticate against
> > AD and then map AD user names to repository user names in order to
> > access those repositories.  If you noted my earlier posts from this
> > morning, you may have noted that I'm looking at recommending JAAS plus
> > sun's kerb5 login module for handling the "authenticate against AD"
> > case, which would cover some 95%+ of the real world authentication needed
> out there.
> >
> >
> I did read your earlier post regarding this, and I totally agree with you -
> this is best handled 'upstream'. In fact, I use a JAAS plugin in other
> places in the product (not Solr) for authentication.
>
>
> >
> > Yes, the idea is to store SIDs in solr at index time.  I don't know
> > enough about solr to know what kinds of issues this might entail, but
> > Lucene certainly has a model of metadata that's pretty flexible, so I
> > don't think this would be difficult at all.  Eric Hatcher also seemed
> > to confirm my suspicions that this would not be a problem.
> >
>
> It's certainly not a problem to store this data in Solr. The problem is
> more that you don't really *want* to store this data at index time.
> There are lots of reasons for not wanting to 'hard-code' SID data with
> documents in the index. Here's just a few:
>  * What happens if/when you want to add explicit user access to some [group
> of] documents ? (i.e. not via a group)
>  * What happens if you need to revoke or change a user's or group's access?
>  * It's difficult to move/replicate the index to another domain
>  * For AD, SIDs are generally not meant to be stored long term outside of
> AD, as they can be changed (this doesn't happen often, but it can happen
> after an AD rebuild, domain type upgrade, data recovery etc.)
>
> These and other senarios mean re-indexing the stored data. When the index
> is huge, this is non-trivial (time-wise). There are not uncommon scenarios
> where user/group access control can change multiple times in one day.
>
> There might be a way of storing acl data in a payload or similar, but I'm
> not sure how that would work across millions of [arbitrarily grouped ]
> documents (I'm not familiar enough with payloads to know if this would be a
> good or bad idea).
>
>
> >
> > This is exactly why I think that we need to do the authentication
> > upstream of the authority world.
> >
> >
> Agreed.
>
>
>
> >
> > If Solr handles arbitrary document metadata, then I think we could
> > just use that feature.  But you know more about it than me, at this
> > point.  It would be great to get an overview of potential ways of doing
> this.
> >
> >
> Payloads, maybe?
>
>
> >
> > For your particular task, it sounds like you are trying to read from
> > NTFS and apply security after-the-fact with some acl specification
> > file.  In that case, I'd write a repository connector that was based
> > on the file system connector (already part of the stable of connectors
> > for LCF) which reads ACL information from your acl.xml file.  Or, if
> > you prefer a UI for specifying ACL information, you could extend the
> > connector so that security is configured in the UI without having an
> > external acl.xml file at all - which would be a nice addition to the
> > existing file system connector.  (Repository connections and jobs are
> > configured internally in LCF by XML documents stored in the database,
> > so they can be arbitrarily structured.  I'm happy to help you figure
> > out how to do this if this is what you decide to do.)
> >
> > For my particular requirements, there are no files -  the data is
> > generated
> from the network and stored. After the fact, there is no persistent
> location of this data other than in Solr.
>
> Storing the acl info using the connector sounds very interesting. Could be
> worth looking at in more details. Thanks!
>
>
>
> > I think we still need to add in the authentication piece to make this
> > all work for you, so perhaps you can describe how you expect a user to
> > interact with your system, so I can understand your design issues.
> >
> > Thanks,
> > Karl
> >
> >
>
>
>
>
>
>
>
>
> > -----Original Message-----
> > From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > Sent: Thursday, April 22, 2010 11:32 AM
> > To: dev@lucene.apache.org
> > Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > Subject: Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for your detailed explanation - really good!
> >
> > As I've thought through some of the implications, I've added comments
> > below, so I hope they don't seem too jumbled...
> >
> > I suppose on the 'authority' side, it works kind of as I envisioned it
> > would.
> >
> > For general Solr access control, there's two layers of security that
> > need to be addressed:
> >  1. Authentication - make sure the incoming query is from a valid
> > user, and the passed-in credentials (hash, certificate etc.) are
> > correct  2. Query filtering - potentially reduce the number/type of
> > returned results based on the allow/deny metadata for the
> > authenticated user
> >
> > I can see how the LCF auth connector works for 2., but can it do 1. as
> > well?
> > It would be good if this could somehow be integrated into any
> > container (Tomcat/Jetty et al) authentication that might be configured
> > (probably related to your previous post). I many ways, it could/should
> > be that the Authority (AD) part of the connector should only be
> > concerned with 1. and not 2. (see below).
> >
> > So, on the repository side, there is also an LCF connector that
> > 'closes the loop' to provide the 'what is it I'm trying to control' side
> of things.
> > I understand that LCF doesn't do the mapping - it delegates this task
> > to the caller, but provides both sides of the equation (authority &
> > repository).
> >
> > >>>>>
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> > <<<<<
> > I think this is the bit that is worrying me - is this storing the SIDs
> > into Solr at document index time? This would be a problem for a whole
> > load of reasons, but maybe I'm missing something here? (see below for
> > a possible
> > alternative)
> >
> > Basically, what I'm getting at here is that the allow/deny values need
> > to be stored in one of three places:
> >  1. In the authority (e.g. inside AD)
> >  2. In the document metadata (index-time)  3. In external storage
> > (e.g. acl.xml, NTFS etc.)
> >
> > 1. Extending AD is pretty much out, as this causes too many interop
> > problems 2. 'Hard-coding' acl information in the index makes it
> > non-portable, resistent to changes, etc.
> > 3. acl.xml is coupled with a Solr instance, but is easily
> > ported/replicated.
> > Storing/retrieving acl information from the source (e.g. NTFS) is
> > problematic, as the source may not be accessible (it may not even exist).
> >
> > I believe 3. or a variant is the way to go on the repo side, which
> > means the LCF Authority connector is mainly for Authentication (see
> > above), which is what you want from AD et al integration.
> > The problem that arises from 'pluggable' authentication is that, if
> > you're not using a certificate, you have to start with a password, but
> > the connector only has access to the password hash (unless the pwd is
> > sent in the query url). I don't know of a way to confirm identities in
> > AD using only the username and hash (AD does the hash compare). I
> > believe this is where container-based integration will likely work
> better.
> >
> > So that I can confirm my understanding...a scenario might be like this:
> >
> > We have an AD connector that fetches the SIDs and we can read them etc.
> > For my environment, where there are no 'files' (there's only a
> > transient network stream), we have an LCF 'Solr Field Filter Query'
> > connector that decides which Filter Queries to apply (allow and deny)
> > for the passed in SID(s).
> >
> > For another environment, let's say, NTFS, there might be an 'NTFS'
> > connector that would provide some kind of mapping of files/folders to
> > SID(s). Since Solr wouldn't intrinscially know about this, the acl
> > information would need to be stored somewhere in the index. This would
> > mean extending the Solr schema and storing metadata at index time.
> > The alternative is to re-use the 'Solr Field Filter Query' connector
> > for this as well (and any other document types that might be read in).
> > This keeps the index 'clean' of acl-specific metadata, and allows for
> > in-place changes and easy cross-document/index/instance access control.
> >
> >
> > If the above interpretation is [roughly] correct (please let me know
> > if I've got this wrong!), this would reduce down to having:
> >   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> > (possibly/partly at the container level)
> >   2. At least an LCF Repository connector for 'acl.xml'
> >   3. Optional other LCF Repository connectors
> >
> > It sounds like you've now finished the first half of 1. by adding the
> > ability to get the required auth data from a Solr api call. The other
> > half of 1. will be implementing the LCF interface in the
> > SolrACLSecurity class, to effectively replace the 'user', 'group' and
> 'password' bits of acl.xml.
> >
> > Does the above sound like an accurate interpretation? Just trying to
> > get a good picture of what work needs doing, where it goes, etc.
> >
> > Many thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
> >
> > >  >>>>>>
> > > What is the relationship between stored data (documents) and
> authorities'
> > > access/deny attributes? (do you have any examples of what an
> > > access_token value might contain?) <<<<<<
> > >
> > > Documents have access/deny attributes; authorities simply provide
> > > the list of tokens that belong to an authenticated user.  Thus,
> > > there's no access/deny for an authority; that's attached to the
> > > document (as it is in real-world repositories).
> > >
> > > Let's run a quick example, using Active Directory and a Windows file
> > > system.  Suppose that you have a directory with documents in it,
> > > call it DirectoryA, and the directory allows read access to the
> > > following
> > SIDs:
> > >
> > > S-123-456-76890
> > > S-23-64-12345
> > >
> > > These SIDs correspond to active directory groups, let's call them
> > > Group1 and Group2, respectively.
> > >
> > > DirectoryB also has documents in it, and those documents have just
> > > the SID S-123-456-76890 attached, because only Group1 can read its
> contents.
> > >
> > > Now, pretend that someone has created an LCF Active Directory
> > > authority connection (in the LCF UI), which is called "myAD", and
> > > this connection is set up to talk to the governing AD domain
> > > controller for this Windows file system.  We now know enough to
> > > describe the document
> > indexing process:
> > >
> > > - Each file in DirectoryA will have the following
> > > __ALLOW_TOKEN__document attributes inside Solr:
> > > "myAD:S-123-456-76890",
> > and "myAD:S-23-64-12345".
> > > - Each file in DirectoryB will have the following
> > > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> > >
> > > Now, suppose that a user (let's call him "Peter") is authenticated
> > > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > > are
> > (say):
> > >
> > > S-1-1-0 (the 'everyone' SID)
> > > S-323-999-12345 (his own personal user SID)
> > > S-23-64-12345 (the SID he gets because he belongs to group 2)
> > >
> > > We want to look up the documents in the search index that he can see.
> > > So, we ask the LCF authority service what his tokens are, and we get
> > back:
> > >
> > > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> > >
> > > The documents we should return in his search are the ones matching
> > > his search criteria, PLUS the intersection of his tokens with the
> > > document ALLOW tokens, MINUS the intersection of his tokens with the
> > > document DENY tokens (there aren't any involved in this example).
> > > So only files that have one of his three tokens as an ALLOW
> > > attribute would be
> > returned.
> > >
> > > Note that what we are attempting to do is enforce AD's security with
> > > the search results we present.  There is no need to define a whole
> > > new security mechanism, because AD already has one that people use.
> > >
> > > >>>>>>
> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > > to ensure there are no security or other dependencies of indexed
> > > data with any external repository - most notably the file system.
> > > There are many reasons for wanting this, but one of the main ones is
> > > that Solr-stored data is not always based on file data (or
> > > accessible
> > file data).
> > > In fact, in my particular case, almost none of the indexed data
> > > comes from files.
> > > <<<<<<
> > >
> > > LCF is all about abstracting from repositories.  It's not
> > > specifically about a file system, although that is a convenient
> > > example.  If you are building your own kind of repository with your
> > > own security setup, that's fine - but in the LCF world you'd need to
> > > create an authority connector for your repository (which maybe reads
> > > your acl.xml file), as well as a repository connector (which hands
> > > documents to LCF and provides it with the access tokens that make
> > > security work).  Of course, you can something much lighter that
> > > doesn't include LCF at all if you are just integrating a custom
> > > repository of your own, but it sounded like you were interested in the
> broader problem here.
> > >
> > > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > > connectors to work cooperatively to define access tokens in a way
> > > that is consistent from authority connector to repository connector
> > > for a given repository kind.  Anybody can write a connector, so the
> > > beauty of all this is that you can build a system where data from
> > > many disparate sources is indexed, and security for each is
> > > simultaneously
> > enforced.
> > >
> > > Karl
> > >
> > >
> > >  ------------------------------
> > > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > > *Sent:* Thursday, April 22, 2010 9:24 AM
> > >
> > > *To:* dev@lucene.apache.org
> > > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > > connectors-dev@incubator.apache.org
> > > *Subject:* Re: FW: Solr and LCF security at query time
> > >
> > > Hi Karl,
> > >
> > > Thanks very much for the diagram -
> > > Sorry about all the questions, but this raises a few new ones...
> > >
> > > What is the relationship between stored data (documents) and
> authorities'
> > > access/deny attributes? (do you have any examples of what an
> > > access_token value might contain?)
> > >
> > > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > > to ensure there are no security or other dependencies of indexed
> > > data with any external repository - most notably the file system.
> > > There are many reasons for wanting this, but one of the main ones is
> > > that Solr-stored data is not always based on file data (or
> > > accessible
> > file data).
> > > In fact, in my particular case, almost none of the indexed data
> > > comes from files.
> > >
> > > This is one reason why SOLR-1872 uses filter queries for its
> > > access/deny tokens - so that all the required information for access
> > > control completely resides within the Solr index itself.
> > > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > > and users, some external 'repository' (files) and users, or
> > > arbitrary
> > data (e.g.
> > > either of these)?
> > >
> > > I hope that makes sense...
> > >
> > > Thanks!
> > > Peter
> > >
> > >
> > >
> > >
> > > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> > >
> > >> Hi Peter,
> > >>
> > >> I've attached a diagram that is not in the wiki as of yet, and I'll
> > >> try to answer your questions.
> > >>
> > >> >>>>>>
> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> > >> stored for a particular user in the underlying acl store (e.g.
> > >> Active
> > Directory)?
> > >> How does AD and/or LCF handle storing such data in its schema?
> > >> (does AD needs its schema extended?) Presumably, any such AD fields
> > >> would need to be queried for effective rights in order to cater for
> > >> group membership allows and denies.
> > >> <<<<<<
> > >>
> > >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> > >> strings that represent a contract between an LCF authority
> > >> connection and the LCF repository connection that picks up the
> > >> documents (from
> > wherever).
> > >>  These tokens thus have no real meaning outside of LCF.  You must
> > >> regard them as opaque.
> > >>
> > >> The contract, however, states that if you use the LCF authority
> > >> service to obtain tokens for an authenticated user, you will get
> > >> back a set that is CONSISTENT with the tokens that were attached to
> > >> the documents LCF sent to Solr for indexing in the first place.
> > >> So, you don't have to worry about it, and that's kind of the idea.
> > >> So you
> > imagine the following flow:
> > >>
> > >> (1) Use LCF to fetch documents and send them to Solr
> > >> (2) When searching, use the LCF authority service to get the
> > >> desired user's access tokens
> > >> (3) Either filter the results, or modify the query, to be sure the
> > >> access tokens all match up properly
> > >>
> > >> For the AD authority, the LCF access tokens consist, in part, of
> > >> the user's SIDs.  For other authorities, the access tokens are
> > >> wildly
> > different.
> > >>  You really don't want to know what's in them, since that's the job
> > >> of the LCF authority to determine. ;-)
> > >>
> > >> LCF is not, by the way, joined at the hip with AD.  However, in
> > >> practice, most enterprises in the world use some form of AD single
> > >> signon for their web applications, and even if they're using some
> > >> repository with its own idea of security, there's a mapping between
> > >> the AD users and the repository's users.  Doing that mapping is
> > >> also the job of the LCF authority for that repository.
> > >>
> > >> Hope this helps.  Also, I'm not expecting time miracles here, so
> > >> don't sweat the schedule.
> > >>
> > >>
> > >> Karl
> > >>
> > >>
> > >> ________________________________________
> > >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> > >> Sent: Thursday, April 22, 2010 4:27 AM
> > >> To: dev@lucene.apache.org
> > >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > >> connectors-dev@incubator.apache.org
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> Hi Karl,
> > >>
> > >> Thanks for the quick turnaround.
> > >> I'm in the middle of a product release for us, so I fear I won't be
> > >> as quick as you... :-)
> > >>
> > >> I couldn't find a simple flow diagram or similar for LCF with
> > >> regards security (probably looking in the wrong place).
> > >> Perhaps you could help on these questions...?
> > >>
> > >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> > >> sub-queries, which are then used as filter queries in a user's search.
> > >>
> > >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> > >> stored for a particular user in the underlying acl store (e.g.
> > >> Active
> > Directory)?
> > >> How does AD and/or LCF handle storing such data in its schema?
> > >> (does AD needs its schema extended?) Presumably, any such AD fields
> > >> would need to be queried for effective rights in order to cater for
> > >> group membership allows and denies.
> > >>
> > >> I guess I'm just trying to understand the architectural
> > >> flow/storage/retrieval of data in the various parts of the system,
> > >> but I admit, I need to do more research on this.
> > >> After our product release, when I get a few more spare cycles, I
> > >> can look at it in more detail.
> > >>
> > >> Many thanks!
> > >> Peter
> > >>
> > >>
> > >>
> > >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> Hi Peter,
> > >>
> > >> I just committed the promised changes to the LCF Solr output
> connector.
> > >>
> > >> ACL metadata will now be posted to the Solr Http interface along
> > >> with the document as the two following fields:
> > >>
> > >> __ACCESS_TOKEN__document
> > >> __DENY_TOKEN__document
> > >>
> > >> There will, of course, potentially be multiple values for each of
> > >> these two fields.
> > >>
> > >> Hope this helps,
> > >> Karl
> > >>
> > >> ________________________________
> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> > >> peter.sturge@googlemail.com>]
> > >> Sent: Tuesday, April 20, 2010 6:51 PM
> > >>
> > >> To: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> Hi Karl,
> > >>
> > >> Thanks for the info. I'll have a look at the link and try to take
> > >> in as much sugar as my insulin levels will handle...
> > >> It sounds like the necessary interface(s) are already in LCF - just
> > >> a matter of implementing them in the Solr 1872 plugin.
> > >> I'll need to digest the LCF stuff to get to grips with it..please
> > >> bear with me while I do that...
> > >>
> > >> When you say:
> > >>   The LCF solr output connection doesn't yet do this, but it is
> > >> trivial for me to make that happen.
> > >> Do you mean a mechanism by which solr.war can get url et al info
> > >> from its parent container (Tomcat, Jetty etc.), or have I
> > >> misinterpreted
> > this?
> > >>
> > >>
> > >> Thanks,
> > >> Peter
> > >>
> > >>
> > >>
> > >>
> > >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> Hi Peter,
> > >>
> > >> I'm the principal committer for LCF, but I don't know as much about
> > >> Solr as I ought to, so it sounds like a potentially productive
> > collaboration.
> > >>
> > >> LCF does exactly what you are looking for - the only issue at all
> > >> is that you need to fetch a URL from a webapp to get what you are
> > >> looking for.  The "plugs" are all inside LCF for different kinds of
> > >> repositories.  Here's a link that might help with drinking the LCF
> > "koolaid", as it were:
> > >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> > >> ct
> > >> ors+Framework+concepts
> > >>
> > >> The url would be something like this (on a locally installed
> > >> tomcat-based LCF instance):
> > >>
> > >>
> > >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> > >> se
> > >> rname@somedomain.com
> > >>
> > >> ... and this fetch returns something like:
> > >>
> > >> TOKEN:xxxxxxx
> > >> TOKEN:yyyyyyy
> > >> TOKEN:zzzzzzz
> > >> ....
> > >>
> > >> ... which represent the amalgamated tokens for all of the defined
> > >> authorities, and by some strange coincidence ( ;-) ) are compatible
> > >> with certain pieces of metadata that have been passed into Solr
> > >> with each document - one set of Allow tokens, and a second set of
> > >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> > >> but it is trivial for me to make that happen.
> > >>
> > >> Does this sound plausible to you?
> > >>
> > >> Karl
> > >>
> > >>
> > >> ________________________________
> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> > >> peter.sturge@googlemail.com>]
> > >> Sent: Tuesday, April 20, 2010 5:41 PM
> > >> To: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> > >> dev@lucene.apache.org>
> > >>
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> Hi Karl,
> > >>
> > >> Integrating LCF to get external token support for SOLR-1872 sounds
> > >> very interesting indeed. I don't know anything about LCF, but one
> > >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> > >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> > >> series of plugins that could be used for obtaining back-end
> > >> authentication
> > information.
> > >>
> > >> If you're good with LCF, perhaps we could work together to build
> > >> this
> > in.
> > >> One of the first things would be defining an interface that would
> > >> be as easy as possible to plug LCF into. Have you any
> > >> suggestions/insight on this front?
> > >>
> > >> Many thanks,
> > >> Peter
> > >>
> > >>
> > >>
> > >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> SOLR-1872 looks exactly like what I was envisioning, from the
> > >> search query perspective, although instead of the acl xml file you
> > >> specify LCF stipulates you would dynamically query the
> > >> lcf-authority-service servlet for the access tokens themselves.
> > >> That would get you support for AD, Documentum, LiveLink, Meridio,
> > >> and Memex for free. It seems likely that this component could be
> > >> modified to work with LCF with minor
> > effort.
> > >>
> > >> The missing component still seems to be AD authentication, which
> > >> needs a solution.
> > >>
> > >> Karl
> > >>
> > >> ________________________________
> > >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> > >> peter.sturge@googlemail.com>]
> > >> Sent: Tuesday, April 20, 2010 10:44 AM
> > >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> > >> Subject: Re: FW: Solr and LCF security at query time
> > >>
> > >> If you want to do this completely within Solr, have a look at:
> > >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> > >>
> > >> Thanks,
> > >> Peter
> > >>
> > >>
> > >>
> > >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> > >> karl.wright@nokia.com>> wrote:
> > >> FYI
> > >>
> > >> ________________________________
> > >> From: Wright Karl (Nokia-S/Cambridge)
> > >> Sent: Tuesday, April 20, 2010 8:16 AM
> > >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> > >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> > >> connectors-dev@incubator.apache.org<mailto:
> > >> connectors-dev@incubator.apache.org>'; '
> > >> connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>'
> > >> Subject: RE: Solr and LCF security at query time
> > >>
> > >> Dominique,
> > >>
> > >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> > >> establishes a powerful multi-repository security model, even though
> > >> it doesn't yet do the final step of enforcing that model at the
> > >> search end.  LCF allows you to define multiple authorities to
> > >> operate against disparate repositories, and use the appropriate
> > >> authority to secure any given document.  The solr people are aware
> > >> of this design, which addresses the issues raised by SOLR-1834 very
> > >> nicely.  However, as I said before, time is a problem, and the work
> > >> still needs to be
> > done.
> > >>
> > >> I suggest you read up on the actual security model of LCF, and
> > >> perhaps experiment with that and the SOLR-1834 contribution, to see
> > >> if there is common ground.  One thing we've learned at MetaCarta is
> > >> that post-filtering for security purposes is expensive, and it is
> > >> better to modify the queries themselves to restrict the results, if
> > >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> > >> sounds like it might be the filtering approach.  Still, it would be
> > better than nothing.
> > >>
> > >> Please let me know what you find out.
> > >>
> > >> Thanks,
> > >> Karl
> > >>
> > >> ________________________________
> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> > >> dominique.bejean@eolya.fr>]
> > >> Sent: Tuesday, April 20, 2010 8:03 AM
> > >> To: Wright Karl (Nokia-S/Cambridge)
> > >> Cc: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>;
> > >> connectors-dev@incubator.apache.org<mailto:
> > >> connectors-dev@incubator.apache.org>
> > >> Subject: Re: Solr and LCF security at query time
> > >>
> > >> Karl,
> > >>
> > >> Thank you for your reply.
> > >>
> > >> I made some research today and I found this :
> > >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> > >> 83
> > >> 4 http://demo.findwise.se:8880/SolrSecurity/
> > >>
> > >> Sorl security model have to be able to filter result list with
> > >> items coming from various sources at the same time (livelink,
> > >> documentum, file system, ...). Big subject :)
> > >>
> > >> Dominique
> > >>
> > >>
> > >> Le 20/04/10 13:34,
> > >> karl.wright@nokia.com<ma...@nokia.com> a ?crit :
> > >> Hi Dominique,
> > >>
> > >> At the moment, in order to enforce the LCF security model within
> > >> Lucene/Solr, you will need to build this functionality into
> > >> whatever client you are using to display the Lucene search results.
> > >> Specifically, you would need to take the following steps:
> > >>
> > >> (1) Have your users access your search client through Apache.
> > >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> > >> mod_authz_annotate, to cause authorization HTTP headers to be
> > >> transmitted to the client webapp.
> > >> (3) Have your client webapp alter whatever queries it is doing, to
> > >> add an appropriate query clause for each of the access tokens
> > >> transmitted in the headers.
> > >>
> > >> (This is how it is done at MetaCarta.)
> > >>
> > >> Alternatively, you may find a way to do this completely with a web
> > >> application under a Java app server such as Tomcat.  I have not yet
> > >> done the research to find out whether this is a feasible alternative.
> > >> Effectively, what you need something like mod_auth_kerb to do is to
> > >> authenticate your user against Active Directory, or whomever the
> > authenticator ought to be.
> > >>  JAAS may be helpful here.
> > >>
> > >> There are, of course, intentions to fill out the missing pieces
> > >> more completely and transparently via a Solr search plugin and/or
> filter.
> > >> What has been lacking is time.  If you are in a position to do
> > >> development in this area, we're happy to have any assistance you
> > >> might
> > provide.
> > >>
> > >> Thanks,
> > >> Karl
> > >> ________________________________
> > >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> > >> Sent: Tuesday, April 20, 2010 5:06 AM
> > >> To: connectors-user@incubator.apache.org<mailto:
> > >> connectors-user@incubator.apache.org>
> > >>  Subject: Solr and LCF security at query time
> > >>
> > >> Hi,
> > >>
> > >> I don't see in LCF wiki how Solr and LCF works together at query
> > >> time in order to remove from the result list the items the user is
> > >> not allowed to access.
> > >>
> > >> In
> > >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> > >> ep
> > >> ts.html,
> > >> I just see these sentences :
> > >>
> > >> " Once all these documents and their access tokens are handed to
> > >> the search engine, it is the search engine's job to enforce
> > >> security by excluding inappropriate documents from the search
> > >> results. For Lucene, this infrastructure is expected to be built on
> > >> top of Lucene's generic metadata abilities, but has not been
> > >> implemented at
> > this time."
> > >>
> > >> I am not sure to understand. Does this mean that for the moment, it
> > >> is not possible for Solr to apply security by using an Authority
> > Connector ?
> > >>
> > >> Dominique
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> -------------------------------------------------------------------
> > >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> > >> additional commands, e-mail: dev-help@lucene.apache.org
> > >>
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> > additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Ok, not hearing back from Peter, I've done some Solr research and written some code that might work.  The approach I've taken is most similar to SOLR 1834, other than the LCF-centric logic.  Hopefully there will be a chance to try this out in a full end-to-end way  on the weekend, after which I will submit it to the Solr team (where I think it most naturally would be built and delivered).

What it's going to need is either a static or dynamic schema addition to define __ALLOW_TOKEN__document, __DENY_TOKEN__document, __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be string, multivalued fields (I think).  It would be great if these could be made a default part of Solr; similarly, it would be good if the new search component was predelivered with Solr and mentioned (even if commented out) in the example solrconfig.xml file.  The only other thing that needs to be done to hook up the search component is to include a configuration parameter describing the base URL of the LCF authority service.  Plus, as I said earlier, we still don't have a canned solution for authentication yet - although I feel that will be straightforward.

Comments welcome...
Karl


________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 27, 2010 8:20 AM
To: connectors-dev@incubator.apache.org; dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index to another domain
  * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org
> > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org
> >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> >> connectors-dev@incubator.apache.org
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> >> dev@lucene.apache.org>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>'; '
> >> connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> >> dominique.bejean@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>;
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com> a ?crit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Ok, not hearing back from Peter, I've done some Solr research and written some code that might work.  The approach I've taken is most similar to SOLR 1834, other than the LCF-centric logic.  Hopefully there will be a chance to try this out in a full end-to-end way  on the weekend, after which I will submit it to the Solr team (where I think it most naturally would be built and delivered).

What it's going to need is either a static or dynamic schema addition to define __ALLOW_TOKEN__document, __DENY_TOKEN__document, __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be string, multivalued fields (I think).  It would be great if these could be made a default part of Solr; similarly, it would be good if the new search component was predelivered with Solr and mentioned (even if commented out) in the example solrconfig.xml file.  The only other thing that needs to be done to hook up the search component is to include a configuration parameter describing the base URL of the LCF authority service.  Plus, as I said earlier, we still don't have a canned solution for authentication yet - although I feel that will be straightforward.

Comments welcome...
Karl


________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 27, 2010 8:20 AM
To: connectors-dev@incubator.apache.org; dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index to another domain
  * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org
> > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org
> >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> >> connectors-dev@incubator.apache.org
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> >> dev@lucene.apache.org>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>'; '
> >> connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> >> dominique.bejean@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>;
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com> a ?crit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Ok, not hearing back from Peter, I've done some Solr research and written some code that might work.  The approach I've taken is most similar to SOLR 1834, other than the LCF-centric logic.  Hopefully there will be a chance to try this out in a full end-to-end way  on the weekend, after which I will submit it to the Solr team (where I think it most naturally would be built and delivered).

What it's going to need is either a static or dynamic schema addition to define __ALLOW_TOKEN__document, __DENY_TOKEN__document, __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be string, multivalued fields (I think).  It would be great if these could be made a default part of Solr; similarly, it would be good if the new search component was predelivered with Solr and mentioned (even if commented out) in the example solrconfig.xml file.  The only other thing that needs to be done to hook up the search component is to include a configuration parameter describing the base URL of the LCF authority service.  Plus, as I said earlier, we still don't have a canned solution for authentication yet - although I feel that will be straightforward.

Comments welcome...
Karl


________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 27, 2010 8:20 AM
To: connectors-dev@incubator.apache.org; dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index to another domain
  * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org
> > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org
> >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> >> connectors-dev@incubator.apache.org
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> >> dev@lucene.apache.org>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>'; '
> >> connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> >> dominique.bejean@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>;
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com> a ?crit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index to another domain
  * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org
> > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org
> >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> >> connectors-dev@incubator.apache.org
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> >> dev@lucene.apache.org>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>'; '
> >> connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> >> dominique.bejean@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>;
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com> a écrit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

>>>>>>
  * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
<<<<<<

In LCF, you change the permissions on the appropriate resource, and then you run your LCF job again to update those permissions.  Since LCF is an incremental crawler, it is smart enough to only reindex those documents whose permissions have changed, which makes it a fairly fast operation on most repositories.  Also, in my experience at MetaCarta, this is a relatively infrequent kind of situation, and most enterprises are pretty resilient against there being a reasonable delay in getting document permissions updated in an index.

However, if this is a concern in your environment, your main alternative is to go directly to the repository on every document as you filter a resultset.  That's slow for most situations, perhaps not for a local acl.xml file.  Performance might be improved with caching, but only if you knew that the same results would be returned for multiple queries.

>>>>>>
  * What happens if you need to revoke or change a user's or group's access?
<<<<<<

I presume you mean a user/group's access to specific documents - which has the same answer as above.  If you actually mean the more typical case, where a user is locked out, or loses/gains group access, that of course happens at authority time, so it is instantaneous.

>>>>>>
  * It's difficult to move/replicate the index to another domain
<<<<<<

Sure.  If this is something you intend on doing a lot, this is not a solution that will work for you.  I don't think we ever had any of MetaCarta's clients even *think* of doing this, however. ;-)  Probably because lots of other stuff breaks as well.

>>>>>>
  * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)
<<<<<<

Any infrequent operation is not much of a concern to me, since LCF keeps track of any changes and will pick them up on the next crawl (and do the minimum possible to update the index, as well).

Thanks,
Karl


-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index to another domain
  * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org
> > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org
> >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> >> connectors-dev@incubator.apache.org
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> >> dev@lucene.apache.org>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>'; '
> >> connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> >> dominique.bejean@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>;
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com> a écrit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> additional commands, e-mail: dev-help@lucene.apache.org
>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index to another domain
  * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org
> > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org
> >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> >> connectors-dev@incubator.apache.org
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> >> dev@lucene.apache.org>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>'; '
> >> connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> >> dominique.bejean@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>;
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com> a écrit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> additional commands, e-mail: dev-help@lucene.apache.org
>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it also appears that each modifies the user query in a different way.  1834 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR filterquery clauses.  Both of them are constructed using Solr FilterQuery objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the different kinds of clause structure.  Do you know if there is a difference?  For example, is there any possibility that AND/OR clauses can permit documents to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  Or, is it a combination of the two, depending on the details of your actual filter clause?

I also haven't heard much from you in the last week or so - have you thought further about what you intend to do, and can you let me know whether you are still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 12:23 PM
To: dev@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-user@incubator.apache.org; lucene-dev@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.
> In fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that
> somebody wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against
> AD and then map AD user names to repository user names in order to
> access those repositories.  If you noted my earlier posts from this
> morning, you may have noted that I'm looking at recommending JAAS plus
> sun's kerb5 login module for handling the "authenticate against AD"
> case, which would cover some 95%+ of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you - this is best handled 'upstream'. In fact, I use a JAAS plugin in other places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know
> enough about solr to know what kinds of issues this might entail, but
> Lucene certainly has a model of metadata that's pretty flexible, so I
> don't think this would be difficult at all.  Eric Hatcher also seemed
> to confirm my suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index to another domain
  * For AD, SIDs are generally not meant to be stored long term outside of AD, as they can be changed (this doesn't happen often, but it can happen after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is huge, this is non-trivial (time-wise). There are not uncommon scenarios where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not sure how that would work across millions of [arbitrarily grouped ] documents (I'm not familiar enough with payloads to know if this would be a good or bad idea).


>
> This is exactly why I think that we need to do the authentication
> upstream of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could
> just use that feature.  But you know more about it than me, at this
> point.  It would be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from
> NTFS and apply security after-the-fact with some acl specification
> file.  In that case, I'd write a repository connector that was based
> on the file system connector (already part of the stable of connectors
> for LCF) which reads ACL information from your acl.xml file.  Or, if
> you prefer a UI for specifying ACL information, you could extend the
> connector so that security is configured in the UI without having an
> external acl.xml file at all - which would be a nice addition to the
> existing file system connector.  (Repository connections and jobs are
> configured internally in LCF by XML documents stored in the database,
> so they can be arbitrarily structured.  I'm happy to help you figure
> out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is
> generated
from the network and stored. After the fact, there is no persistent location of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this
> all work for you, so perhaps you can describe how you expect a user to
> interact with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that
> need to be addressed:
>  1. Authentication - make sure the incoming query is from a valid
> user, and the passed-in credentials (hash, certificate etc.) are
> correct  2. Query filtering - potentially reduce the number/type of
> returned results based on the allow/deny metadata for the
> authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any
> container (Tomcat/Jetty et al) authentication that might be configured
> (probably related to your previous post). I many ways, it could/should
> be that the Authority (AD) part of the connector should only be
> concerned with 1. and not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that
> 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task
> to the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs
> into Solr at document index time? This would be a problem for a whole
> load of reasons, but maybe I'm missing something here? (see below for
> a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need
> to be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)  3. In external storage
> (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which
> means the LCF Authority connector is mainly for Authentication (see
> above), which is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if
> you're not using a certificate, you have to start with a password, but
> the connector only has access to the password hash (unless the pwd is
> sent in the query url). I don't know of a way to confirm identities in
> AD using only the username and hash (AD does the hash compare). I
> believe this is where container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a
> transient network stream), we have an LCF 'Solr Field Filter Query'
> connector that decides which Filter Queries to apply (allow and deny)
> for the passed in SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would
> mean extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector
> for this as well (and any other document types that might be read in).
> This keeps the index 'clean' of acl-specific metadata, and allows for
> in-place changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know
> if I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other
> half of 1. will be implementing the LCF interface in the
> SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to
> get a good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide
> > the list of tokens that belong to an authenticated user.  Thus,
> > there's no access/deny for an authority; that's attached to the
> > document (as it is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it,
> > call it DirectoryA, and the directory allows read access to the
> > following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just
> > the SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and
> > this connection is set up to talk to the governing AD domain
> > controller for this Windows file system.  We now know enough to
> > describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr:
> > "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs
> > are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching
> > his search criteria, PLUS the intersection of his tokens with the
> > document ALLOW tokens, MINUS the intersection of his tokens with the
> > document DENY tokens (there aren't any involved in this example).
> > So only files that have one of his three tokens as an ALLOW
> > attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole
> > new security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not
> > specifically about a file system, although that is a convenient
> > example.  If you are building your own kind of repository with your
> > own security setup, that's fine - but in the LCF world you'd need to
> > create an authority connector for your repository (which maybe reads
> > your acl.xml file), as well as a repository connector (which hands
> > documents to LCF and provides it with the access tokens that make
> > security work).  Of course, you can something much lighter that
> > doesn't include LCF at all if you are just integrating a custom
> > repository of your own, but it sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way
> > that is consistent from authority connector to repository connector
> > for a given repository kind.  Anybody can write a connector, so the
> > beauty of all this is that you can build a system where data from
> > many disparate sources is indexed, and security for each is
> > simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org
> > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed
> > data with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or
> > accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data
> > comes from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or
> > arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority
> >> connection and the LCF repository connection that picks up the
> >> documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get
> >> back a set that is CONSISTENT with the tokens that were attached to
> >> the documents LCF sent to Solr for indexing in the first place.
> >> So, you don't have to worry about it, and that's kind of the idea.
> >> So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the
> >> desired user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of
> >> the user's SIDs.  For other authorities, the access tokens are
> >> wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is
> >> also the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org
> >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> >> connectors-dev@incubator.apache.org
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with
> >> regards security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been
> >> stored for a particular user in the underlying acl store (e.g.
> >> Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema?
> >> (does AD needs its schema extended?) Presumably, any such AD fields
> >> would need to be queried for effective rights in order to cater for
> >> group membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I
> >> can look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along
> >> with the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take
> >> in as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just
> >> a matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info
> >> from its parent container (Tomcat, Jetty etc.), or have I
> >> misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all
> >> is that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Conne
> >> ct
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someu
> >> se
> >> rname@somedomain.com
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr
> >> with each document - one set of Allow tokens, and a second set of
> >> Deny tokens.  The LCF solr output connection doesn't yet do this,
> >> but it is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> >> dev@lucene.apache.org>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one
> >> of the things I was planning for SOLR-1872 is to make acl.xml (or
> >> rather its behaviour) 'pluggable' - i.e. it would just be one of a
> >> series of plugins that could be used for obtaining back-end
> >> authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build
> >> this
> in.
> >> One of the first things would be defining an interface that would
> >> be as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the
> >> search query perspective, although instead of the acl xml file you
> >> specify LCF stipulates you would dynamically query the
> >> lcf-authority-service servlet for the access tokens themselves.
> >> That would get you support for AD, Documentum, LiveLink, Meridio,
> >> and Memex for free. It seems likely that this component could be
> >> modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>'; '
> >> connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to
> >> operate against disparate repositories, and use the appropriate
> >> authority to secure any given document.  The solr people are aware
> >> of this design, which addresses the issues raised by SOLR-1834 very
> >> nicely.  However, as I said before, time is a problem, and the work
> >> still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> >> dominique.bejean@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>;
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1
> >> 83
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with
> >> items coming from various sources at the same time (livelink,
> >> documentum, file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com> a écrit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into
> >> whatever client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces
> >> more completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you
> >> might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query
> >> time in order to remove from the result list the items the user is
> >> not allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-conc
> >> ep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to
> >> the search engine, it is the search engine's job to enforce
> >> security by excluding inappropriate documents from the search
> >> results. For Lucene, this infrastructure is expected to be built on
> >> top of Lucene's generic metadata abilities, but has not been
> >> implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.  In
> fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that somebody
> wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against AD and
> then map AD user names to repository user names in order to access those
> repositories.  If you noted my earlier posts from this morning, you may have
> noted that I'm looking at recommending JAAS plus sun's kerb5 login module
> for handling the "authenticate against AD" case, which would cover some 95%+
> of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you -
this is best handled 'upstream'. In fact, I use a JAAS plugin in other
places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know enough
> about solr to know what kinds of issues this might entail, but Lucene
> certainly has a model of metadata that's pretty flexible, so I don't think
> this would be difficult at all.  Eric Hatcher also seemed to confirm my
> suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more
that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with
documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group
of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index to another domain
  * For AD, SIDs are generally not meant to be stored long term outside of
AD, as they can be changed (this doesn't happen often, but it can happen
after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is
huge, this is non-trivial (time-wise). There are not uncommon scenarios
where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm
not sure how that would work across millions of [arbitrarily grouped ]
documents (I'm not familiar enough with payloads to know if this would be a
good or bad idea).


>
> This is exactly why I think that we need to do the authentication upstream
> of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could just use
> that feature.  But you know more about it than me, at this point.  It would
> be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from NTFS
> and apply security after-the-fact with some acl specification file.  In that
> case, I'd write a repository connector that was based on the file system
> connector (already part of the stable of connectors for LCF) which reads ACL
> information from your acl.xml file.  Or, if you prefer a UI for specifying
> ACL information, you could extend the connector so that security is
> configured in the UI without having an external acl.xml file at all - which
> would be a nice addition to the existing file system connector.  (Repository
> connections and jobs are configured internally in LCF by XML documents
> stored in the database, so they can be arbitrarily structured.  I'm happy to
> help you figure out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is generated
from the network and stored. After the fact, there is no persistent location
of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be
worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this all
> work for you, so perhaps you can describe how you expect a user to interact
> with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that need
> to be addressed:
>  1. Authentication - make sure the incoming query is from a valid user, and
> the passed-in credentials (hash, certificate etc.) are correct
>  2. Query filtering - potentially reduce the number/type of returned
> results based on the allow/deny metadata for the authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any container
> (Tomcat/Jetty et al) authentication that might be configured (probably
> related to your previous post). I many ways, it could/should be that the
> Authority (AD) part of the connector should only be concerned with 1. and
> not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that 'closes the
> loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task to
> the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs into
> Solr at document index time? This would be a problem for a whole load of
> reasons, but maybe I'm missing something here? (see below for a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need to
> be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)
>  3. In external storage (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which means
> the LCF Authority connector is mainly for Authentication (see above), which
> is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if you're
> not using a certificate, you have to start with a password, but the
> connector only has access to the password hash (unless the pwd is sent in
> the query url). I don't know of a way to confirm identities in AD using only
> the username and hash (AD does the hash compare). I believe this is where
> container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a transient
> network stream), we have an LCF 'Solr Field Filter Query' connector that
> decides which Filter Queries to apply (allow and deny) for the passed in
> SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would mean
> extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector for
> this as well (and any other document types that might be read in). This
> keeps the index 'clean' of acl-specific metadata, and allows for in-place
> changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know if
> I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other half
> of 1. will be implementing the LCF interface in the SolrACLSecurity class,
> to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to get a
> good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide the
> > list of tokens that belong to an authenticated user.  Thus, there's no
> > access/deny for an authority; that's attached to the document (as it
> > is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it, call
> > it DirectoryA, and the directory allows read access to the following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just the
> > SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and this
> > connection is set up to talk to the governing AD domain controller for
> > this Windows file system.  We now know enough to describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching his
> > search criteria, PLUS the intersection of his tokens with the document
> > ALLOW tokens, MINUS the intersection of his tokens with the document
> > DENY tokens (there aren't any involved in this example).  So only
> > files that have one of his three tokens as an ALLOW attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole new
> > security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed data
> > with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data comes
> > from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not specifically
> > about a file system, although that is a convenient example.  If you
> > are building your own kind of repository with your own security setup,
> > that's fine - but in the LCF world you'd need to create an authority
> > connector for your repository (which maybe reads your acl.xml file),
> > as well as a repository connector (which hands documents to LCF and
> > provides it with the access tokens that make security work).  Of
> > course, you can something much lighter that doesn't include LCF at all
> > if you are just integrating a custom repository of your own, but it
> > sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way that
> > is consistent from authority connector to repository connector for a
> > given repository kind.  Anybody can write a connector, so the beauty
> > of all this is that you can build a system where data from many
> > disparate sources is indexed, and security for each is simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org
> > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed data
> > with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data comes
> > from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
> >> for a particular user in the underlying acl store (e.g. Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema? (does
> >> AD needs its schema extended?) Presumably, any such AD fields would
> >> need to be queried for effective rights in order to cater for group
> >> membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority connection
> >> and the LCF repository connection that picks up the documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get back
> >> a set that is CONSISTENT with the tokens that were attached to the
> >> documents LCF sent to Solr for indexing in the first place.  So, you
> >> don't have to worry about it, and that's kind of the idea.  So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the desired
> >> user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of the
> >> user's SIDs.  For other authorities, the access tokens are wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is also
> >> the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org
> >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> >> connectors-dev@incubator.apache.org
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with regards
> >> security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
> >> for a particular user in the underlying acl store (e.g. Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema? (does
> >> AD needs its schema extended?) Presumably, any such AD fields would
> >> need to be queried for effective rights in order to cater for group
> >> membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I can
> >> look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along with
> >> the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take in
> >> as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just a
> >> matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info from
> >> its parent container (Tomcat, Jetty etc.), or have I misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all is
> >> that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connect
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someuse
> >> rname@somedomain.com
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr with
> >> each document - one set of Allow tokens, and a second set of Deny
> >> tokens.  The LCF solr output connection doesn't yet do this, but it
> >> is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> >> dev@lucene.apache.org>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one of
> >> the things I was planning for SOLR-1872 is to make acl.xml (or rather
> >> its behaviour) 'pluggable' - i.e. it would just be one of a series of
> >> plugins that could be used for obtaining back-end authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build this
> in.
> >> One of the first things would be defining an interface that would be
> >> as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the search
> >> query perspective, although instead of the acl xml file you specify
> >> LCF stipulates you would dynamically query the lcf-authority-service
> >> servlet for the access tokens themselves.  That would get you support
> >> for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems
> >> likely that this component could be modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>'; '
> >> connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to operate
> >> against disparate repositories, and use the appropriate authority to
> >> secure any given document.  The solr people are aware of this design,
> >> which addresses the issues raised by SOLR-1834 very nicely.  However,
> >> as I said before, time is a problem, and the work still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> >> dominique.bejean@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>;
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-183
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with items
> >> coming from various sources at the same time (livelink, documentum,
> >> file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com> a écrit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into whatever
> >> client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces more
> >> completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query time
> >> in order to remove from the result list the items the user is not
> >> allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to the
> >> search engine, it is the search engine's job to enforce security by
> >> excluding inappropriate documents from the search results. For
> >> Lucene, this infrastructure is expected to be built on top of
> >> Lucene's generic metadata abilities, but has not been implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.  In
> fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that somebody
> wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against AD and
> then map AD user names to repository user names in order to access those
> repositories.  If you noted my earlier posts from this morning, you may have
> noted that I'm looking at recommending JAAS plus sun's kerb5 login module
> for handling the "authenticate against AD" case, which would cover some 95%+
> of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you -
this is best handled 'upstream'. In fact, I use a JAAS plugin in other
places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know enough
> about solr to know what kinds of issues this might entail, but Lucene
> certainly has a model of metadata that's pretty flexible, so I don't think
> this would be difficult at all.  Eric Hatcher also seemed to confirm my
> suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more
that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with
documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group
of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index to another domain
  * For AD, SIDs are generally not meant to be stored long term outside of
AD, as they can be changed (this doesn't happen often, but it can happen
after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is
huge, this is non-trivial (time-wise). There are not uncommon scenarios
where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm
not sure how that would work across millions of [arbitrarily grouped ]
documents (I'm not familiar enough with payloads to know if this would be a
good or bad idea).


>
> This is exactly why I think that we need to do the authentication upstream
> of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could just use
> that feature.  But you know more about it than me, at this point.  It would
> be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from NTFS
> and apply security after-the-fact with some acl specification file.  In that
> case, I'd write a repository connector that was based on the file system
> connector (already part of the stable of connectors for LCF) which reads ACL
> information from your acl.xml file.  Or, if you prefer a UI for specifying
> ACL information, you could extend the connector so that security is
> configured in the UI without having an external acl.xml file at all - which
> would be a nice addition to the existing file system connector.  (Repository
> connections and jobs are configured internally in LCF by XML documents
> stored in the database, so they can be arbitrarily structured.  I'm happy to
> help you figure out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is generated
from the network and stored. After the fact, there is no persistent location
of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be
worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this all
> work for you, so perhaps you can describe how you expect a user to interact
> with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that need
> to be addressed:
>  1. Authentication - make sure the incoming query is from a valid user, and
> the passed-in credentials (hash, certificate etc.) are correct
>  2. Query filtering - potentially reduce the number/type of returned
> results based on the allow/deny metadata for the authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any container
> (Tomcat/Jetty et al) authentication that might be configured (probably
> related to your previous post). I many ways, it could/should be that the
> Authority (AD) part of the connector should only be concerned with 1. and
> not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that 'closes the
> loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task to
> the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs into
> Solr at document index time? This would be a problem for a whole load of
> reasons, but maybe I'm missing something here? (see below for a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need to
> be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)
>  3. In external storage (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which means
> the LCF Authority connector is mainly for Authentication (see above), which
> is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if you're
> not using a certificate, you have to start with a password, but the
> connector only has access to the password hash (unless the pwd is sent in
> the query url). I don't know of a way to confirm identities in AD using only
> the username and hash (AD does the hash compare). I believe this is where
> container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a transient
> network stream), we have an LCF 'Solr Field Filter Query' connector that
> decides which Filter Queries to apply (allow and deny) for the passed in
> SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would mean
> extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector for
> this as well (and any other document types that might be read in). This
> keeps the index 'clean' of acl-specific metadata, and allows for in-place
> changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know if
> I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other half
> of 1. will be implementing the LCF interface in the SolrACLSecurity class,
> to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to get a
> good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide the
> > list of tokens that belong to an authenticated user.  Thus, there's no
> > access/deny for an authority; that's attached to the document (as it
> > is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it, call
> > it DirectoryA, and the directory allows read access to the following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just the
> > SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and this
> > connection is set up to talk to the governing AD domain controller for
> > this Windows file system.  We now know enough to describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching his
> > search criteria, PLUS the intersection of his tokens with the document
> > ALLOW tokens, MINUS the intersection of his tokens with the document
> > DENY tokens (there aren't any involved in this example).  So only
> > files that have one of his three tokens as an ALLOW attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole new
> > security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed data
> > with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data comes
> > from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not specifically
> > about a file system, although that is a convenient example.  If you
> > are building your own kind of repository with your own security setup,
> > that's fine - but in the LCF world you'd need to create an authority
> > connector for your repository (which maybe reads your acl.xml file),
> > as well as a repository connector (which hands documents to LCF and
> > provides it with the access tokens that make security work).  Of
> > course, you can something much lighter that doesn't include LCF at all
> > if you are just integrating a custom repository of your own, but it
> > sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way that
> > is consistent from authority connector to repository connector for a
> > given repository kind.  Anybody can write a connector, so the beauty
> > of all this is that you can build a system where data from many
> > disparate sources is indexed, and security for each is simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org
> > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed data
> > with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data comes
> > from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
> >> for a particular user in the underlying acl store (e.g. Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema? (does
> >> AD needs its schema extended?) Presumably, any such AD fields would
> >> need to be queried for effective rights in order to cater for group
> >> membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority connection
> >> and the LCF repository connection that picks up the documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get back
> >> a set that is CONSISTENT with the tokens that were attached to the
> >> documents LCF sent to Solr for indexing in the first place.  So, you
> >> don't have to worry about it, and that's kind of the idea.  So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the desired
> >> user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of the
> >> user's SIDs.  For other authorities, the access tokens are wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is also
> >> the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org
> >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> >> connectors-dev@incubator.apache.org
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with regards
> >> security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
> >> for a particular user in the underlying acl store (e.g. Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema? (does
> >> AD needs its schema extended?) Presumably, any such AD fields would
> >> need to be queried for effective rights in order to cater for group
> >> membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I can
> >> look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along with
> >> the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take in
> >> as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just a
> >> matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info from
> >> its parent container (Tomcat, Jetty etc.), or have I misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all is
> >> that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connect
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someuse
> >> rname@somedomain.com
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr with
> >> each document - one set of Allow tokens, and a second set of Deny
> >> tokens.  The LCF solr output connection doesn't yet do this, but it
> >> is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> >> dev@lucene.apache.org>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one of
> >> the things I was planning for SOLR-1872 is to make acl.xml (or rather
> >> its behaviour) 'pluggable' - i.e. it would just be one of a series of
> >> plugins that could be used for obtaining back-end authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build this
> in.
> >> One of the first things would be defining an interface that would be
> >> as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the search
> >> query perspective, although instead of the acl xml file you specify
> >> LCF stipulates you would dynamically query the lcf-authority-service
> >> servlet for the access tokens themselves.  That would get you support
> >> for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems
> >> likely that this component could be modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>'; '
> >> connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to operate
> >> against disparate repositories, and use the appropriate authority to
> >> secure any given document.  The solr people are aware of this design,
> >> which addresses the issues raised by SOLR-1834 very nicely.  However,
> >> as I said before, time is a problem, and the work still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> >> dominique.bejean@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>;
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-183
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with items
> >> coming from various sources at the same time (livelink, documentum,
> >> file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com> a écrit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into whatever
> >> client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces more
> >> completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query time
> >> in order to remove from the result list the items the user is not
> >> allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to the
> >> search engine, it is the search engine's job to enforce security by
> >> excluding inappropriate documents from the search results. For
> >> Lucene, this infrastructure is expected to be built on top of
> >> Lucene's generic metadata abilities, but has not been implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> The authority connectors don't perform authentication at this time.  In
> fact, LCF has nothing to do with authentication at all - just authorization.
>  The reason for this is because it is almost never the case that somebody
> wants to provide multiple credentials in order to be able see their results.
>  Most enterprises who have multiple repositories authenticate against AD and
> then map AD user names to repository user names in order to access those
> repositories.  If you noted my earlier posts from this morning, you may have
> noted that I'm looking at recommending JAAS plus sun's kerb5 login module
> for handling the "authenticate against AD" case, which would cover some 95%+
> of the real world authentication needed out there.
>
>
I did read your earlier post regarding this, and I totally agree with you -
this is best handled 'upstream'. In fact, I use a JAAS plugin in other
places in the product (not Solr) for authentication.


>
> Yes, the idea is to store SIDs in solr at index time.  I don't know enough
> about solr to know what kinds of issues this might entail, but Lucene
> certainly has a model of metadata that's pretty flexible, so I don't think
> this would be difficult at all.  Eric Hatcher also seemed to confirm my
> suspicions that this would not be a problem.
>

It's certainly not a problem to store this data in Solr. The problem is more
that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with
documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group
of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index to another domain
  * For AD, SIDs are generally not meant to be stored long term outside of
AD, as they can be changed (this doesn't happen often, but it can happen
after an AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is
huge, this is non-trivial (time-wise). There are not uncommon scenarios
where user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm
not sure how that would work across millions of [arbitrarily grouped ]
documents (I'm not familiar enough with payloads to know if this would be a
good or bad idea).


>
> This is exactly why I think that we need to do the authentication upstream
> of the authority world.
>
>
Agreed.



>
> If Solr handles arbitrary document metadata, then I think we could just use
> that feature.  But you know more about it than me, at this point.  It would
> be great to get an overview of potential ways of doing this.
>
>
Payloads, maybe?


>
> For your particular task, it sounds like you are trying to read from NTFS
> and apply security after-the-fact with some acl specification file.  In that
> case, I'd write a repository connector that was based on the file system
> connector (already part of the stable of connectors for LCF) which reads ACL
> information from your acl.xml file.  Or, if you prefer a UI for specifying
> ACL information, you could extend the connector so that security is
> configured in the UI without having an external acl.xml file at all - which
> would be a nice addition to the existing file system connector.  (Repository
> connections and jobs are configured internally in LCF by XML documents
> stored in the database, so they can be arbitrarily structured.  I'm happy to
> help you figure out how to do this if this is what you decide to do.)
>
> For my particular requirements, there are no files -  the data is generated
from the network and stored. After the fact, there is no persistent location
of this data other than in Solr.

Storing the acl info using the connector sounds very interesting. Could be
worth looking at in more details. Thanks!



> I think we still need to add in the authentication piece to make this all
> work for you, so perhaps you can describe how you expect a user to interact
> with your system, so I can understand your design issues.
>
> Thanks,
> Karl
>
>








> -----Original Message-----
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 11:32 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for your detailed explanation - really good!
>
> As I've thought through some of the implications, I've added comments
> below, so I hope they don't seem too jumbled...
>
> I suppose on the 'authority' side, it works kind of as I envisioned it
> would.
>
> For general Solr access control, there's two layers of security that need
> to be addressed:
>  1. Authentication - make sure the incoming query is from a valid user, and
> the passed-in credentials (hash, certificate etc.) are correct
>  2. Query filtering - potentially reduce the number/type of returned
> results based on the allow/deny metadata for the authenticated user
>
> I can see how the LCF auth connector works for 2., but can it do 1. as
> well?
> It would be good if this could somehow be integrated into any container
> (Tomcat/Jetty et al) authentication that might be configured (probably
> related to your previous post). I many ways, it could/should be that the
> Authority (AD) part of the connector should only be concerned with 1. and
> not 2. (see below).
>
> So, on the repository side, there is also an LCF connector that 'closes the
> loop' to provide the 'what is it I'm trying to control' side of things.
> I understand that LCF doesn't do the mapping - it delegates this task to
> the caller, but provides both sides of the equation (authority &
> repository).
>
> >>>>>
> - Each file in DirectoryA will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890"
> <<<<<
> I think this is the bit that is worrying me - is this storing the SIDs into
> Solr at document index time? This would be a problem for a whole load of
> reasons, but maybe I'm missing something here? (see below for a possible
> alternative)
>
> Basically, what I'm getting at here is that the allow/deny values need to
> be stored in one of three places:
>  1. In the authority (e.g. inside AD)
>  2. In the document metadata (index-time)
>  3. In external storage (e.g. acl.xml, NTFS etc.)
>
> 1. Extending AD is pretty much out, as this causes too many interop
> problems 2. 'Hard-coding' acl information in the index makes it
> non-portable, resistent to changes, etc.
> 3. acl.xml is coupled with a Solr instance, but is easily
> ported/replicated.
> Storing/retrieving acl information from the source (e.g. NTFS) is
> problematic, as the source may not be accessible (it may not even exist).
>
> I believe 3. or a variant is the way to go on the repo side, which means
> the LCF Authority connector is mainly for Authentication (see above), which
> is what you want from AD et al integration.
> The problem that arises from 'pluggable' authentication is that, if you're
> not using a certificate, you have to start with a password, but the
> connector only has access to the password hash (unless the pwd is sent in
> the query url). I don't know of a way to confirm identities in AD using only
> the username and hash (AD does the hash compare). I believe this is where
> container-based integration will likely work better.
>
> So that I can confirm my understanding...a scenario might be like this:
>
> We have an AD connector that fetches the SIDs and we can read them etc.
> For my environment, where there are no 'files' (there's only a transient
> network stream), we have an LCF 'Solr Field Filter Query' connector that
> decides which Filter Queries to apply (allow and deny) for the passed in
> SID(s).
>
> For another environment, let's say, NTFS, there might be an 'NTFS'
> connector that would provide some kind of mapping of files/folders to
> SID(s). Since Solr wouldn't intrinscially know about this, the acl
> information would need to be stored somewhere in the index. This would mean
> extending the Solr schema and storing metadata at index time.
> The alternative is to re-use the 'Solr Field Filter Query' connector for
> this as well (and any other document types that might be read in). This
> keeps the index 'clean' of acl-specific metadata, and allows for in-place
> changes and easy cross-document/index/instance access control.
>
>
> If the above interpretation is [roughly] correct (please let me know if
> I've got this wrong!), this would reduce down to having:
>   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
> (possibly/partly at the container level)
>   2. At least an LCF Repository connector for 'acl.xml'
>   3. Optional other LCF Repository connectors
>
> It sounds like you've now finished the first half of 1. by adding the
> ability to get the required auth data from a Solr api call. The other half
> of 1. will be implementing the LCF interface in the SolrACLSecurity class,
> to effectively replace the 'user', 'group' and 'password' bits of acl.xml.
>
> Does the above sound like an accurate interpretation? Just trying to get a
> good picture of what work needs doing, where it goes, etc.
>
> Many thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:
>
> >  >>>>>>
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?) <<<<<<
> >
> > Documents have access/deny attributes; authorities simply provide the
> > list of tokens that belong to an authenticated user.  Thus, there's no
> > access/deny for an authority; that's attached to the document (as it
> > is in real-world repositories).
> >
> > Let's run a quick example, using Active Directory and a Windows file
> > system.  Suppose that you have a directory with documents in it, call
> > it DirectoryA, and the directory allows read access to the following
> SIDs:
> >
> > S-123-456-76890
> > S-23-64-12345
> >
> > These SIDs correspond to active directory groups, let's call them
> > Group1 and Group2, respectively.
> >
> > DirectoryB also has documents in it, and those documents have just the
> > SID S-123-456-76890 attached, because only Group1 can read its contents.
> >
> > Now, pretend that someone has created an LCF Active Directory
> > authority connection (in the LCF UI), which is called "myAD", and this
> > connection is set up to talk to the governing AD domain controller for
> > this Windows file system.  We now know enough to describe the document
> indexing process:
> >
> > - Each file in DirectoryA will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890",
> and "myAD:S-23-64-12345".
> > - Each file in DirectoryB will have the following
> > __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
> >
> > Now, suppose that a user (let's call him "Peter") is authenticated
> > with the AD domain controller.  Peter belongs to Group2, so his SIDs are
> (say):
> >
> > S-1-1-0 (the 'everyone' SID)
> > S-323-999-12345 (his own personal user SID)
> > S-23-64-12345 (the SID he gets because he belongs to group 2)
> >
> > We want to look up the documents in the search index that he can see.
> > So, we ask the LCF authority service what his tokens are, and we get
> back:
> >
> > "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
> >
> > The documents we should return in his search are the ones matching his
> > search criteria, PLUS the intersection of his tokens with the document
> > ALLOW tokens, MINUS the intersection of his tokens with the document
> > DENY tokens (there aren't any involved in this example).  So only
> > files that have one of his three tokens as an ALLOW attribute would be
> returned.
> >
> > Note that what we are attempting to do is enforce AD's security with
> > the search results we present.  There is no need to define a whole new
> > security mechanism, because AD already has one that people use.
> >
> > >>>>>>
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed data
> > with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data comes
> > from files.
> > <<<<<<
> >
> > LCF is all about abstracting from repositories.  It's not specifically
> > about a file system, although that is a convenient example.  If you
> > are building your own kind of repository with your own security setup,
> > that's fine - but in the LCF world you'd need to create an authority
> > connector for your repository (which maybe reads your acl.xml file),
> > as well as a repository connector (which hands documents to LCF and
> > provides it with the access tokens that make security work).  Of
> > course, you can something much lighter that doesn't include LCF at all
> > if you are just integrating a custom repository of your own, but it
> > sounded like you were interested in the broader problem here.
> >
> > So, LCF doesn't do "acl mapping" at all.  It relies on its various
> > connectors to work cooperatively to define access tokens in a way that
> > is consistent from authority connector to repository connector for a
> > given repository kind.  Anybody can write a connector, so the beauty
> > of all this is that you can build a system where data from many
> > disparate sources is indexed, and security for each is simultaneously
> enforced.
> >
> > Karl
> >
> >
> >  ------------------------------
> > *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> > *Sent:* Thursday, April 22, 2010 9:24 AM
> >
> > *To:* dev@lucene.apache.org
> > *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> > connectors-dev@incubator.apache.org
> > *Subject:* Re: FW: Solr and LCF security at query time
> >
> > Hi Karl,
> >
> > Thanks very much for the diagram -
> > Sorry about all the questions, but this raises a few new ones...
> >
> > What is the relationship between stored data (documents) and authorities'
> > access/deny attributes? (do you have any examples of what an
> > access_token value might contain?)
> >
> > One of the key requirements I've worked to adhere to in SOLR-1872 is
> > to ensure there are no security or other dependencies of indexed data
> > with any external repository - most notably the file system.
> > There are many reasons for wanting this, but one of the main ones is
> > that Solr-stored data is not always based on file data (or accessible
> file data).
> > In fact, in my particular case, almost none of the indexed data comes
> > from files.
> >
> > This is one reason why SOLR-1872 uses filter queries for its
> > access/deny tokens - so that all the required information for access
> > control completely resides within the Solr index itself.
> > Is the LCF architecture acl 'mapping' between Solr fields (queries)
> > and users, some external 'repository' (files) and users, or arbitrary
> data (e.g.
> > either of these)?
> >
> > I hope that makes sense...
> >
> > Thanks!
> > Peter
> >
> >
> >
> >
> > On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
> >
> >> Hi Peter,
> >>
> >> I've attached a diagram that is not in the wiki as of yet, and I'll
> >> try to answer your questions.
> >>
> >> >>>>>>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
> >> for a particular user in the underlying acl store (e.g. Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema? (does
> >> AD needs its schema extended?) Presumably, any such AD fields would
> >> need to be queried for effective rights in order to cater for group
> >> membership allows and denies.
> >> <<<<<<
> >>
> >> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
> >> strings that represent a contract between an LCF authority connection
> >> and the LCF repository connection that picks up the documents (from
> wherever).
> >>  These tokens thus have no real meaning outside of LCF.  You must
> >> regard them as opaque.
> >>
> >> The contract, however, states that if you use the LCF authority
> >> service to obtain tokens for an authenticated user, you will get back
> >> a set that is CONSISTENT with the tokens that were attached to the
> >> documents LCF sent to Solr for indexing in the first place.  So, you
> >> don't have to worry about it, and that's kind of the idea.  So you
> imagine the following flow:
> >>
> >> (1) Use LCF to fetch documents and send them to Solr
> >> (2) When searching, use the LCF authority service to get the desired
> >> user's access tokens
> >> (3) Either filter the results, or modify the query, to be sure the
> >> access tokens all match up properly
> >>
> >> For the AD authority, the LCF access tokens consist, in part, of the
> >> user's SIDs.  For other authorities, the access tokens are wildly
> different.
> >>  You really don't want to know what's in them, since that's the job
> >> of the LCF authority to determine. ;-)
> >>
> >> LCF is not, by the way, joined at the hip with AD.  However, in
> >> practice, most enterprises in the world use some form of AD single
> >> signon for their web applications, and even if they're using some
> >> repository with its own idea of security, there's a mapping between
> >> the AD users and the repository's users.  Doing that mapping is also
> >> the job of the LCF authority for that repository.
> >>
> >> Hope this helps.  Also, I'm not expecting time miracles here, so
> >> don't sweat the schedule.
> >>
> >>
> >> Karl
> >>
> >>
> >> ________________________________________
> >> From: ext Peter Sturge [peter.sturge@googlemail.com]
> >> Sent: Thursday, April 22, 2010 4:27 AM
> >> To: dev@lucene.apache.org
> >> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> >> connectors-dev@incubator.apache.org
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the quick turnaround.
> >> I'm in the middle of a product release for us, so I fear I won't be
> >> as quick as you... :-)
> >>
> >> I couldn't find a simple flow diagram or similar for LCF with regards
> >> security (probably looking in the wrong place).
> >> Perhaps you could help on these questions...?
> >>
> >> In SOLR-1872, the allows and denies are stored (in acl.xml) as
> >> sub-queries, which are then used as filter queries in a user's search.
> >>
> >> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
> >> for a particular user in the underlying acl store (e.g. Active
> Directory)?
> >> How does AD and/or LCF handle storing such data in its schema? (does
> >> AD needs its schema extended?) Presumably, any such AD fields would
> >> need to be queried for effective rights in order to cater for group
> >> membership allows and denies.
> >>
> >> I guess I'm just trying to understand the architectural
> >> flow/storage/retrieval of data in the various parts of the system,
> >> but I admit, I need to do more research on this.
> >> After our product release, when I get a few more spare cycles, I can
> >> look at it in more detail.
> >>
> >> Many thanks!
> >> Peter
> >>
> >>
> >>
> >> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I just committed the promised changes to the LCF Solr output connector.
> >>
> >> ACL metadata will now be posted to the Solr Http interface along with
> >> the document as the two following fields:
> >>
> >> __ACCESS_TOKEN__document
> >> __DENY_TOKEN__document
> >>
> >> There will, of course, potentially be multiple values for each of
> >> these two fields.
> >>
> >> Hope this helps,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 6:51 PM
> >>
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Thanks for the info. I'll have a look at the link and try to take in
> >> as much sugar as my insulin levels will handle...
> >> It sounds like the necessary interface(s) are already in LCF - just a
> >> matter of implementing them in the Solr 1872 plugin.
> >> I'll need to digest the LCF stuff to get to grips with it..please
> >> bear with me while I do that...
> >>
> >> When you say:
> >>   The LCF solr output connection doesn't yet do this, but it is
> >> trivial for me to make that happen.
> >> Do you mean a mechanism by which solr.war can get url et al info from
> >> its parent container (Tomcat, Jetty etc.), or have I misinterpreted
> this?
> >>
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> Hi Peter,
> >>
> >> I'm the principal committer for LCF, but I don't know as much about
> >> Solr as I ought to, so it sounds like a potentially productive
> collaboration.
> >>
> >> LCF does exactly what you are looking for - the only issue at all is
> >> that you need to fetch a URL from a webapp to get what you are
> >> looking for.  The "plugs" are all inside LCF for different kinds of
> >> repositories.  Here's a link that might help with drinking the LCF
> "koolaid", as it were:
> >> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connect
> >> ors+Framework+concepts
> >>
> >> The url would be something like this (on a locally installed
> >> tomcat-based LCF instance):
> >>
> >>
> >> http://localhost:8080/lcf-authority-service/UserACLs?username=someuse
> >> rname@somedomain.com
> >>
> >> ... and this fetch returns something like:
> >>
> >> TOKEN:xxxxxxx
> >> TOKEN:yyyyyyy
> >> TOKEN:zzzzzzz
> >> ....
> >>
> >> ... which represent the amalgamated tokens for all of the defined
> >> authorities, and by some strange coincidence ( ;-) ) are compatible
> >> with certain pieces of metadata that have been passed into Solr with
> >> each document - one set of Allow tokens, and a second set of Deny
> >> tokens.  The LCF solr output connection doesn't yet do this, but it
> >> is trivial for me to make that happen.
> >>
> >> Does this sound plausible to you?
> >>
> >> Karl
> >>
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 5:41 PM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> >> dev@lucene.apache.org>
> >>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> Hi Karl,
> >>
> >> Integrating LCF to get external token support for SOLR-1872 sounds
> >> very interesting indeed. I don't know anything about LCF, but one of
> >> the things I was planning for SOLR-1872 is to make acl.xml (or rather
> >> its behaviour) 'pluggable' - i.e. it would just be one of a series of
> >> plugins that could be used for obtaining back-end authentication
> information.
> >>
> >> If you're good with LCF, perhaps we could work together to build this
> in.
> >> One of the first things would be defining an interface that would be
> >> as easy as possible to plug LCF into. Have you any
> >> suggestions/insight on this front?
> >>
> >> Many thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> SOLR-1872 looks exactly like what I was envisioning, from the search
> >> query perspective, although instead of the acl xml file you specify
> >> LCF stipulates you would dynamically query the lcf-authority-service
> >> servlet for the access tokens themselves.  That would get you support
> >> for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems
> >> likely that this component could be modified to work with LCF with minor
> effort.
> >>
> >> The missing component still seems to be AD authentication, which
> >> needs a solution.
> >>
> >> Karl
> >>
> >> ________________________________
> >> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> >> peter.sturge@googlemail.com>]
> >> Sent: Tuesday, April 20, 2010 10:44 AM
> >> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> >> Subject: Re: FW: Solr and LCF security at query time
> >>
> >> If you want to do this completely within Solr, have a look at:
> >> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
> >>
> >> Thanks,
> >> Peter
> >>
> >>
> >>
> >> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> >> karl.wright@nokia.com>> wrote:
> >> FYI
> >>
> >> ________________________________
> >> From: Wright Karl (Nokia-S/Cambridge)
> >> Sent: Tuesday, April 20, 2010 8:16 AM
> >> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> >> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>'; '
> >> connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>'
> >> Subject: RE: Solr and LCF security at query time
> >>
> >> Dominique,
> >>
> >> Yes, I am aware of this ticket and contribution.  Luckily LCF
> >> establishes a powerful multi-repository security model, even though
> >> it doesn't yet do the final step of enforcing that model at the
> >> search end.  LCF allows you to define multiple authorities to operate
> >> against disparate repositories, and use the appropriate authority to
> >> secure any given document.  The solr people are aware of this design,
> >> which addresses the issues raised by SOLR-1834 very nicely.  However,
> >> as I said before, time is a problem, and the work still needs to be
> done.
> >>
> >> I suggest you read up on the actual security model of LCF, and
> >> perhaps experiment with that and the SOLR-1834 contribution, to see
> >> if there is common ground.  One thing we've learned at MetaCarta is
> >> that post-filtering for security purposes is expensive, and it is
> >> better to modify the queries themselves to restrict the results, if
> >> possible.  I'm not sure which approach SOLR-1834 takes, although it
> >> sounds like it might be the filtering approach.  Still, it would be
> better than nothing.
> >>
> >> Please let me know what you find out.
> >>
> >> Thanks,
> >> Karl
> >>
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> >> dominique.bejean@eolya.fr>]
> >> Sent: Tuesday, April 20, 2010 8:03 AM
> >> To: Wright Karl (Nokia-S/Cambridge)
> >> Cc: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>;
> >> connectors-dev@incubator.apache.org<mailto:
> >> connectors-dev@incubator.apache.org>
> >> Subject: Re: Solr and LCF security at query time
> >>
> >> Karl,
> >>
> >> Thank you for your reply.
> >>
> >> I made some research today and I found this :
> >> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-183
> >> 4 http://demo.findwise.se:8880/SolrSecurity/
> >>
> >> Sorl security model have to be able to filter result list with items
> >> coming from various sources at the same time (livelink, documentum,
> >> file system, ...). Big subject :)
> >>
> >> Dominique
> >>
> >>
> >> Le 20/04/10 13:34,
> >> karl.wright@nokia.com<ma...@nokia.com> a écrit :
> >> Hi Dominique,
> >>
> >> At the moment, in order to enforce the LCF security model within
> >> Lucene/Solr, you will need to build this functionality into whatever
> >> client you are using to display the Lucene search results.
> >> Specifically, you would need to take the following steps:
> >>
> >> (1) Have your users access your search client through Apache.
> >> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> >> mod_authz_annotate, to cause authorization HTTP headers to be
> >> transmitted to the client webapp.
> >> (3) Have your client webapp alter whatever queries it is doing, to
> >> add an appropriate query clause for each of the access tokens
> >> transmitted in the headers.
> >>
> >> (This is how it is done at MetaCarta.)
> >>
> >> Alternatively, you may find a way to do this completely with a web
> >> application under a Java app server such as Tomcat.  I have not yet
> >> done the research to find out whether this is a feasible alternative.
> >> Effectively, what you need something like mod_auth_kerb to do is to
> >> authenticate your user against Active Directory, or whomever the
> authenticator ought to be.
> >>  JAAS may be helpful here.
> >>
> >> There are, of course, intentions to fill out the missing pieces more
> >> completely and transparently via a Solr search plugin and/or filter.
> >> What has been lacking is time.  If you are in a position to do
> >> development in this area, we're happy to have any assistance you might
> provide.
> >>
> >> Thanks,
> >> Karl
> >> ________________________________
> >> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> >> Sent: Tuesday, April 20, 2010 5:06 AM
> >> To: connectors-user@incubator.apache.org<mailto:
> >> connectors-user@incubator.apache.org>
> >>  Subject: Solr and LCF security at query time
> >>
> >> Hi,
> >>
> >> I don't see in LCF wiki how Solr and LCF works together at query time
> >> in order to remove from the result list the items the user is not
> >> allowed to access.
> >>
> >> In
> >> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concep
> >> ts.html,
> >> I just see these sentences :
> >>
> >> " Once all these documents and their access tokens are handed to the
> >> search engine, it is the search engine's job to enforce security by
> >> excluding inappropriate documents from the search results. For
> >> Lucene, this infrastructure is expected to be built on top of
> >> Lucene's generic metadata abilities, but has not been implemented at
> this time."
> >>
> >> I am not sure to understand. Does this mean that for the moment, it
> >> is not possible for Solr to apply security by using an Authority
> Connector ?
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

>>>>>>
For general Solr access control, there's two layers of security that need to be addressed:
  1. Authentication - make sure the incoming query is from a valid user, and the passed-in credentials (hash, certificate etc.) are correct
  2. Query filtering - potentially reduce the number/type of returned results based on the allow/deny metadata for the authenticated user

I can see how the LCF auth connector works for 2., but can it do 1. as well?
<<<<<<

The authority connectors don't perform authentication at this time.  In fact, LCF has nothing to do with authentication at all - just authorization.  The reason for this is because it is almost never the case that somebody wants to provide multiple credentials in order to be able see their results.  Most enterprises who have multiple repositories authenticate against AD and then map AD user names to repository user names in order to access those repositories.  If you noted my earlier posts from this morning, you may have noted that I'm looking at recommending JAAS plus sun's kerb5 login module for handling the "authenticate against AD" case, which would cover some 95%+ of the real world authentication needed out there.


>>>>>>
I think this is the bit that is worrying me - is this storing the SIDs into Solr at document index time? This would be a problem for a whole load of reasons, but maybe I'm missing something here? (see below for a possible
alternative)
<<<<<<

Yes, the idea is to store SIDs in solr at index time.  I don't know enough about solr to know what kinds of issues this might entail, but Lucene certainly has a model of metadata that's pretty flexible, so I don't think this would be difficult at all.  Eric Hatcher also seemed to confirm my suspicions that this would not be a problem.

>>>>>>
The problem that arises from 'pluggable' authentication is that, if you're not using a certificate, you have to start with a password, but the connector only has access to the password hash (unless the pwd is sent in the query url). I don't know of a way to confirm identities in AD using only the username and hash (AD does the hash compare). I believe this is where container-based integration will likely work better.
<<<<<<

This is exactly why I think that we need to do the authentication upstream of the authority world.

>>>>>>
For another environment, let's say, NTFS, there might be an 'NTFS' connector that would provide some kind of mapping of files/folders to SID(s). Since Solr wouldn't intrinscially know about this, the acl information would need to be stored somewhere in the index. This would mean extending the Solr schema and storing metadata at index time.
<<<<<<

If Solr handles arbitrary document metadata, then I think we could just use that feature.  But you know more about it than me, at this point.  It would be great to get an overview of potential ways of doing this.

>>>>>>
If the above interpretation is [roughly] correct (please let me know if I've got this wrong!), this would reduce down to having:
   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.) (possibly/partly at the container level)
   2. At least an LCF Repository connector for 'acl.xml'
   3. Optional other LCF Repository connectors
<<<<<<

For your particular task, it sounds like you are trying to read from NTFS and apply security after-the-fact with some acl specification file.  In that case, I'd write a repository connector that was based on the file system connector (already part of the stable of connectors for LCF) which reads ACL information from your acl.xml file.  Or, if you prefer a UI for specifying ACL information, you could extend the connector so that security is configured in the UI without having an external acl.xml file at all - which would be a nice addition to the existing file system connector.  (Repository connections and jobs are configured internally in LCF by XML documents stored in the database, so they can be arbitrarily structured.  I'm happy to help you figure out how to do this if this is what you decide to do.)

I think we still need to add in the authentication piece to make this all work for you, so perhaps you can describe how you expect a user to interact with your system, so I can understand your design issues.

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 11:32 AM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks very much for your detailed explanation - really good!

As I've thought through some of the implications, I've added comments below, so I hope they don't seem too jumbled...

I suppose on the 'authority' side, it works kind of as I envisioned it would.

For general Solr access control, there's two layers of security that need to be addressed:
  1. Authentication - make sure the incoming query is from a valid user, and the passed-in credentials (hash, certificate etc.) are correct
  2. Query filtering - potentially reduce the number/type of returned results based on the allow/deny metadata for the authenticated user

I can see how the LCF auth connector works for 2., but can it do 1. as well?
It would be good if this could somehow be integrated into any container (Tomcat/Jetty et al) authentication that might be configured (probably related to your previous post). I many ways, it could/should be that the Authority (AD) part of the connector should only be concerned with 1. and not 2. (see below).

So, on the repository side, there is also an LCF connector that 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
I understand that LCF doesn't do the mapping - it delegates this task to the caller, but provides both sides of the equation (authority & repository).

>>>>>
- Each file in DirectoryA will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
- Each file in DirectoryB will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
<<<<<
I think this is the bit that is worrying me - is this storing the SIDs into Solr at document index time? This would be a problem for a whole load of reasons, but maybe I'm missing something here? (see below for a possible
alternative)

Basically, what I'm getting at here is that the allow/deny values need to be stored in one of three places:
  1. In the authority (e.g. inside AD)
  2. In the document metadata (index-time)
  3. In external storage (e.g. acl.xml, NTFS etc.)

1. Extending AD is pretty much out, as this causes too many interop problems 2. 'Hard-coding' acl information in the index makes it non-portable, resistent to changes, etc.
3. acl.xml is coupled with a Solr instance, but is easily ported/replicated.
Storing/retrieving acl information from the source (e.g. NTFS) is problematic, as the source may not be accessible (it may not even exist).

I believe 3. or a variant is the way to go on the repo side, which means the LCF Authority connector is mainly for Authentication (see above), which is what you want from AD et al integration.
The problem that arises from 'pluggable' authentication is that, if you're not using a certificate, you have to start with a password, but the connector only has access to the password hash (unless the pwd is sent in the query url). I don't know of a way to confirm identities in AD using only the username and hash (AD does the hash compare). I believe this is where container-based integration will likely work better.

So that I can confirm my understanding...a scenario might be like this:

We have an AD connector that fetches the SIDs and we can read them etc.
For my environment, where there are no 'files' (there's only a transient network stream), we have an LCF 'Solr Field Filter Query' connector that decides which Filter Queries to apply (allow and deny) for the passed in SID(s).

For another environment, let's say, NTFS, there might be an 'NTFS' connector that would provide some kind of mapping of files/folders to SID(s). Since Solr wouldn't intrinscially know about this, the acl information would need to be stored somewhere in the index. This would mean extending the Solr schema and storing metadata at index time.
The alternative is to re-use the 'Solr Field Filter Query' connector for this as well (and any other document types that might be read in). This keeps the index 'clean' of acl-specific metadata, and allows for in-place changes and easy cross-document/index/instance access control.


If the above interpretation is [roughly] correct (please let me know if I've got this wrong!), this would reduce down to having:
   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.) (possibly/partly at the container level)
   2. At least an LCF Repository connector for 'acl.xml'
   3. Optional other LCF Repository connectors

It sounds like you've now finished the first half of 1. by adding the ability to get the required auth data from a Solr api call. The other half of 1. will be implementing the LCF interface in the SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.

Does the above sound like an accurate interpretation? Just trying to get a good picture of what work needs doing, where it goes, etc.

Many thanks!
Peter




On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:

>  >>>>>>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an
> access_token value might contain?) <<<<<<
>
> Documents have access/deny attributes; authorities simply provide the
> list of tokens that belong to an authenticated user.  Thus, there's no
> access/deny for an authority; that's attached to the document (as it
> is in real-world repositories).
>
> Let's run a quick example, using Active Directory and a Windows file
> system.  Suppose that you have a directory with documents in it, call
> it DirectoryA, and the directory allows read access to the following SIDs:
>
> S-123-456-76890
> S-23-64-12345
>
> These SIDs correspond to active directory groups, let's call them
> Group1 and Group2, respectively.
>
> DirectoryB also has documents in it, and those documents have just the
> SID S-123-456-76890 attached, because only Group1 can read its contents.
>
> Now, pretend that someone has created an LCF Active Directory
> authority connection (in the LCF UI), which is called "myAD", and this
> connection is set up to talk to the governing AD domain controller for
> this Windows file system.  We now know enough to describe the document indexing process:
>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
>
> Now, suppose that a user (let's call him "Peter") is authenticated
> with the AD domain controller.  Peter belongs to Group2, so his SIDs are (say):
>
> S-1-1-0 (the 'everyone' SID)
> S-323-999-12345 (his own personal user SID)
> S-23-64-12345 (the SID he gets because he belongs to group 2)
>
> We want to look up the documents in the search index that he can see.
> So, we ask the LCF authority service what his tokens are, and we get back:
>
> "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
>
> The documents we should return in his search are the ones matching his
> search criteria, PLUS the intersection of his tokens with the document
> ALLOW tokens, MINUS the intersection of his tokens with the document
> DENY tokens (there aren't any involved in this example).  So only
> files that have one of his three tokens as an ALLOW attribute would be returned.
>
> Note that what we are attempting to do is enforce AD's security with
> the search results we present.  There is no need to define a whole new
> security mechanism, because AD already has one that people use.
>
> >>>>>>
> One of the key requirements I've worked to adhere to in SOLR-1872 is
> to ensure there are no security or other dependencies of indexed data
> with any external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is
> that Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes
> from files.
> <<<<<<
>
> LCF is all about abstracting from repositories.  It's not specifically
> about a file system, although that is a convenient example.  If you
> are building your own kind of repository with your own security setup,
> that's fine - but in the LCF world you'd need to create an authority
> connector for your repository (which maybe reads your acl.xml file),
> as well as a repository connector (which hands documents to LCF and
> provides it with the access tokens that make security work).  Of
> course, you can something much lighter that doesn't include LCF at all
> if you are just integrating a custom repository of your own, but it
> sounded like you were interested in the broader problem here.
>
> So, LCF doesn't do "acl mapping" at all.  It relies on its various
> connectors to work cooperatively to define access tokens in a way that
> is consistent from authority connector to repository connector for a
> given repository kind.  Anybody can write a connector, so the beauty
> of all this is that you can build a system where data from many
> disparate sources is indexed, and security for each is simultaneously enforced.
>
> Karl
>
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Thursday, April 22, 2010 9:24 AM
>
> *To:* dev@lucene.apache.org
> *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for the diagram -
> Sorry about all the questions, but this raises a few new ones...
>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an
> access_token value might contain?)
>
> One of the key requirements I've worked to adhere to in SOLR-1872 is
> to ensure there are no security or other dependencies of indexed data
> with any external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is
> that Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes
> from files.
>
> This is one reason why SOLR-1872 uses filter queries for its
> access/deny tokens - so that all the required information for access
> control completely resides within the Solr index itself.
> Is the LCF architecture acl 'mapping' between Solr fields (queries)
> and users, some external 'repository' (files) and users, or arbitrary data (e.g.
> either of these)?
>
> I hope that makes sense...
>
> Thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
>
>> Hi Peter,
>>
>> I've attached a diagram that is not in the wiki as of yet, and I'll
>> try to answer your questions.
>>
>> >>>>>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
>> for a particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does
>> AD needs its schema extended?) Presumably, any such AD fields would
>> need to be queried for effective rights in order to cater for group
>> membership allows and denies.
>> <<<<<<
>>
>> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
>> strings that represent a contract between an LCF authority connection
>> and the LCF repository connection that picks up the documents (from wherever).
>>  These tokens thus have no real meaning outside of LCF.  You must
>> regard them as opaque.
>>
>> The contract, however, states that if you use the LCF authority
>> service to obtain tokens for an authenticated user, you will get back
>> a set that is CONSISTENT with the tokens that were attached to the
>> documents LCF sent to Solr for indexing in the first place.  So, you
>> don't have to worry about it, and that's kind of the idea.  So you imagine the following flow:
>>
>> (1) Use LCF to fetch documents and send them to Solr
>> (2) When searching, use the LCF authority service to get the desired
>> user's access tokens
>> (3) Either filter the results, or modify the query, to be sure the
>> access tokens all match up properly
>>
>> For the AD authority, the LCF access tokens consist, in part, of the
>> user's SIDs.  For other authorities, the access tokens are wildly different.
>>  You really don't want to know what's in them, since that's the job
>> of the LCF authority to determine. ;-)
>>
>> LCF is not, by the way, joined at the hip with AD.  However, in
>> practice, most enterprises in the world use some form of AD single
>> signon for their web applications, and even if they're using some
>> repository with its own idea of security, there's a mapping between
>> the AD users and the repository's users.  Doing that mapping is also
>> the job of the LCF authority for that repository.
>>
>> Hope this helps.  Also, I'm not expecting time miracles here, so
>> don't sweat the schedule.
>>
>>
>> Karl
>>
>>
>> ________________________________________
>> From: ext Peter Sturge [peter.sturge@googlemail.com]
>> Sent: Thursday, April 22, 2010 4:27 AM
>> To: dev@lucene.apache.org
>> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> connectors-dev@incubator.apache.org
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the quick turnaround.
>> I'm in the middle of a product release for us, so I fear I won't be
>> as quick as you... :-)
>>
>> I couldn't find a simple flow diagram or similar for LCF with regards
>> security (probably looking in the wrong place).
>> Perhaps you could help on these questions...?
>>
>> In SOLR-1872, the allows and denies are stored (in acl.xml) as
>> sub-queries, which are then used as filter queries in a user's search.
>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
>> for a particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does
>> AD needs its schema extended?) Presumably, any such AD fields would
>> need to be queried for effective rights in order to cater for group
>> membership allows and denies.
>>
>> I guess I'm just trying to understand the architectural
>> flow/storage/retrieval of data in the various parts of the system,
>> but I admit, I need to do more research on this.
>> After our product release, when I get a few more spare cycles, I can
>> look at it in more detail.
>>
>> Many thanks!
>> Peter
>>
>>
>>
>> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I just committed the promised changes to the LCF Solr output connector.
>>
>> ACL metadata will now be posted to the Solr Http interface along with
>> the document as the two following fields:
>>
>> __ACCESS_TOKEN__document
>> __DENY_TOKEN__document
>>
>> There will, of course, potentially be multiple values for each of
>> these two fields.
>>
>> Hope this helps,
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 6:51 PM
>>
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the info. I'll have a look at the link and try to take in
>> as much sugar as my insulin levels will handle...
>> It sounds like the necessary interface(s) are already in LCF - just a
>> matter of implementing them in the Solr 1872 plugin.
>> I'll need to digest the LCF stuff to get to grips with it..please
>> bear with me while I do that...
>>
>> When you say:
>>   The LCF solr output connection doesn't yet do this, but it is
>> trivial for me to make that happen.
>> Do you mean a mechanism by which solr.war can get url et al info from
>> its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>>
>>
>> Thanks,
>> Peter
>>
>>
>>
>>
>> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I'm the principal committer for LCF, but I don't know as much about
>> Solr as I ought to, so it sounds like a potentially productive collaboration.
>>
>> LCF does exactly what you are looking for - the only issue at all is
>> that you need to fetch a URL from a webapp to get what you are
>> looking for.  The "plugs" are all inside LCF for different kinds of
>> repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were:
>> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connect
>> ors+Framework+concepts
>>
>> The url would be something like this (on a locally installed
>> tomcat-based LCF instance):
>>
>>
>> http://localhost:8080/lcf-authority-service/UserACLs?username=someuse
>> rname@somedomain.com
>>
>> ... and this fetch returns something like:
>>
>> TOKEN:xxxxxxx
>> TOKEN:yyyyyyy
>> TOKEN:zzzzzzz
>> ....
>>
>> ... which represent the amalgamated tokens for all of the defined
>> authorities, and by some strange coincidence ( ;-) ) are compatible
>> with certain pieces of metadata that have been passed into Solr with
>> each document - one set of Allow tokens, and a second set of Deny
>> tokens.  The LCF solr output connection doesn't yet do this, but it
>> is trivial for me to make that happen.
>>
>> Does this sound plausible to you?
>>
>> Karl
>>
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 5:41 PM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
>> dev@lucene.apache.org>
>>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Integrating LCF to get external token support for SOLR-1872 sounds
>> very interesting indeed. I don't know anything about LCF, but one of
>> the things I was planning for SOLR-1872 is to make acl.xml (or rather
>> its behaviour) 'pluggable' - i.e. it would just be one of a series of
>> plugins that could be used for obtaining back-end authentication information.
>>
>> If you're good with LCF, perhaps we could work together to build this in.
>> One of the first things would be defining an interface that would be
>> as easy as possible to plug LCF into. Have you any
>> suggestions/insight on this front?
>>
>> Many thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> SOLR-1872 looks exactly like what I was envisioning, from the search
>> query perspective, although instead of the acl xml file you specify
>> LCF stipulates you would dynamically query the lcf-authority-service
>> servlet for the access tokens themselves.  That would get you support
>> for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems
>> likely that this component could be modified to work with LCF with minor effort.
>>
>> The missing component still seems to be AD authentication, which
>> needs a solution.
>>
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 10:44 AM
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> If you want to do this completely within Solr, have a look at:
>> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>>
>> Thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> FYI
>>
>> ________________________________
>> From: Wright Karl (Nokia-S/Cambridge)
>> Sent: Tuesday, April 20, 2010 8:16 AM
>> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
>> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>'; '
>> connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>'
>> Subject: RE: Solr and LCF security at query time
>>
>> Dominique,
>>
>> Yes, I am aware of this ticket and contribution.  Luckily LCF
>> establishes a powerful multi-repository security model, even though
>> it doesn't yet do the final step of enforcing that model at the
>> search end.  LCF allows you to define multiple authorities to operate
>> against disparate repositories, and use the appropriate authority to
>> secure any given document.  The solr people are aware of this design,
>> which addresses the issues raised by SOLR-1834 very nicely.  However,
>> as I said before, time is a problem, and the work still needs to be done.
>>
>> I suggest you read up on the actual security model of LCF, and
>> perhaps experiment with that and the SOLR-1834 contribution, to see
>> if there is common ground.  One thing we've learned at MetaCarta is
>> that post-filtering for security purposes is expensive, and it is
>> better to modify the queries themselves to restrict the results, if
>> possible.  I'm not sure which approach SOLR-1834 takes, although it
>> sounds like it might be the filtering approach.  Still, it would be better than nothing.
>>
>> Please let me know what you find out.
>>
>> Thanks,
>> Karl
>>
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
>> dominique.bejean@eolya.fr>]
>> Sent: Tuesday, April 20, 2010 8:03 AM
>> To: Wright Karl (Nokia-S/Cambridge)
>> Cc: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>;
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>
>> Subject: Re: Solr and LCF security at query time
>>
>> Karl,
>>
>> Thank you for your reply.
>>
>> I made some research today and I found this :
>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-183
>> 4 http://demo.findwise.se:8880/SolrSecurity/
>>
>> Sorl security model have to be able to filter result list with items
>> coming from various sources at the same time (livelink, documentum,
>> file system, ...). Big subject :)
>>
>> Dominique
>>
>>
>> Le 20/04/10 13:34,
>> karl.wright@nokia.com<ma...@nokia.com> a écrit :
>> Hi Dominique,
>>
>> At the moment, in order to enforce the LCF security model within
>> Lucene/Solr, you will need to build this functionality into whatever
>> client you are using to display the Lucene search results.
>> Specifically, you would need to take the following steps:
>>
>> (1) Have your users access your search client through Apache.
>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>> mod_authz_annotate, to cause authorization HTTP headers to be
>> transmitted to the client webapp.
>> (3) Have your client webapp alter whatever queries it is doing, to
>> add an appropriate query clause for each of the access tokens
>> transmitted in the headers.
>>
>> (This is how it is done at MetaCarta.)
>>
>> Alternatively, you may find a way to do this completely with a web
>> application under a Java app server such as Tomcat.  I have not yet
>> done the research to find out whether this is a feasible alternative.
>> Effectively, what you need something like mod_auth_kerb to do is to
>> authenticate your user against Active Directory, or whomever the authenticator ought to be.
>>  JAAS may be helpful here.
>>
>> There are, of course, intentions to fill out the missing pieces more
>> completely and transparently via a Solr search plugin and/or filter.
>> What has been lacking is time.  If you are in a position to do
>> development in this area, we're happy to have any assistance you might provide.
>>
>> Thanks,
>> Karl
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> Sent: Tuesday, April 20, 2010 5:06 AM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>>  Subject: Solr and LCF security at query time
>>
>> Hi,
>>
>> I don't see in LCF wiki how Solr and LCF works together at query time
>> in order to remove from the result list the items the user is not
>> allowed to access.
>>
>> In
>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concep
>> ts.html,
>> I just see these sentences :
>>
>> " Once all these documents and their access tokens are handed to the
>> search engine, it is the search engine's job to enforce security by
>> excluding inappropriate documents from the search results. For
>> Lucene, this infrastructure is expected to be built on top of
>> Lucene's generic metadata abilities, but has not been implemented at this time."
>>
>> I am not sure to understand. Does this mean that for the moment, it
>> is not possible for Solr to apply security by using an Authority Connector ?
>>
>> Dominique
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>> additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

>>>>>>
For general Solr access control, there's two layers of security that need to be addressed:
  1. Authentication - make sure the incoming query is from a valid user, and the passed-in credentials (hash, certificate etc.) are correct
  2. Query filtering - potentially reduce the number/type of returned results based on the allow/deny metadata for the authenticated user

I can see how the LCF auth connector works for 2., but can it do 1. as well?
<<<<<<

The authority connectors don't perform authentication at this time.  In fact, LCF has nothing to do with authentication at all - just authorization.  The reason for this is because it is almost never the case that somebody wants to provide multiple credentials in order to be able see their results.  Most enterprises who have multiple repositories authenticate against AD and then map AD user names to repository user names in order to access those repositories.  If you noted my earlier posts from this morning, you may have noted that I'm looking at recommending JAAS plus sun's kerb5 login module for handling the "authenticate against AD" case, which would cover some 95%+ of the real world authentication needed out there.


>>>>>>
I think this is the bit that is worrying me - is this storing the SIDs into Solr at document index time? This would be a problem for a whole load of reasons, but maybe I'm missing something here? (see below for a possible
alternative)
<<<<<<

Yes, the idea is to store SIDs in solr at index time.  I don't know enough about solr to know what kinds of issues this might entail, but Lucene certainly has a model of metadata that's pretty flexible, so I don't think this would be difficult at all.  Eric Hatcher also seemed to confirm my suspicions that this would not be a problem.

>>>>>>
The problem that arises from 'pluggable' authentication is that, if you're not using a certificate, you have to start with a password, but the connector only has access to the password hash (unless the pwd is sent in the query url). I don't know of a way to confirm identities in AD using only the username and hash (AD does the hash compare). I believe this is where container-based integration will likely work better.
<<<<<<

This is exactly why I think that we need to do the authentication upstream of the authority world.

>>>>>>
For another environment, let's say, NTFS, there might be an 'NTFS' connector that would provide some kind of mapping of files/folders to SID(s). Since Solr wouldn't intrinscially know about this, the acl information would need to be stored somewhere in the index. This would mean extending the Solr schema and storing metadata at index time.
<<<<<<

If Solr handles arbitrary document metadata, then I think we could just use that feature.  But you know more about it than me, at this point.  It would be great to get an overview of potential ways of doing this.

>>>>>>
If the above interpretation is [roughly] correct (please let me know if I've got this wrong!), this would reduce down to having:
   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.) (possibly/partly at the container level)
   2. At least an LCF Repository connector for 'acl.xml'
   3. Optional other LCF Repository connectors
<<<<<<

For your particular task, it sounds like you are trying to read from NTFS and apply security after-the-fact with some acl specification file.  In that case, I'd write a repository connector that was based on the file system connector (already part of the stable of connectors for LCF) which reads ACL information from your acl.xml file.  Or, if you prefer a UI for specifying ACL information, you could extend the connector so that security is configured in the UI without having an external acl.xml file at all - which would be a nice addition to the existing file system connector.  (Repository connections and jobs are configured internally in LCF by XML documents stored in the database, so they can be arbitrarily structured.  I'm happy to help you figure out how to do this if this is what you decide to do.)

I think we still need to add in the authentication piece to make this all work for you, so perhaps you can describe how you expect a user to interact with your system, so I can understand your design issues.

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 11:32 AM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks very much for your detailed explanation - really good!

As I've thought through some of the implications, I've added comments below, so I hope they don't seem too jumbled...

I suppose on the 'authority' side, it works kind of as I envisioned it would.

For general Solr access control, there's two layers of security that need to be addressed:
  1. Authentication - make sure the incoming query is from a valid user, and the passed-in credentials (hash, certificate etc.) are correct
  2. Query filtering - potentially reduce the number/type of returned results based on the allow/deny metadata for the authenticated user

I can see how the LCF auth connector works for 2., but can it do 1. as well?
It would be good if this could somehow be integrated into any container (Tomcat/Jetty et al) authentication that might be configured (probably related to your previous post). I many ways, it could/should be that the Authority (AD) part of the connector should only be concerned with 1. and not 2. (see below).

So, on the repository side, there is also an LCF connector that 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
I understand that LCF doesn't do the mapping - it delegates this task to the caller, but provides both sides of the equation (authority & repository).

>>>>>
- Each file in DirectoryA will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
- Each file in DirectoryB will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
<<<<<
I think this is the bit that is worrying me - is this storing the SIDs into Solr at document index time? This would be a problem for a whole load of reasons, but maybe I'm missing something here? (see below for a possible
alternative)

Basically, what I'm getting at here is that the allow/deny values need to be stored in one of three places:
  1. In the authority (e.g. inside AD)
  2. In the document metadata (index-time)
  3. In external storage (e.g. acl.xml, NTFS etc.)

1. Extending AD is pretty much out, as this causes too many interop problems 2. 'Hard-coding' acl information in the index makes it non-portable, resistent to changes, etc.
3. acl.xml is coupled with a Solr instance, but is easily ported/replicated.
Storing/retrieving acl information from the source (e.g. NTFS) is problematic, as the source may not be accessible (it may not even exist).

I believe 3. or a variant is the way to go on the repo side, which means the LCF Authority connector is mainly for Authentication (see above), which is what you want from AD et al integration.
The problem that arises from 'pluggable' authentication is that, if you're not using a certificate, you have to start with a password, but the connector only has access to the password hash (unless the pwd is sent in the query url). I don't know of a way to confirm identities in AD using only the username and hash (AD does the hash compare). I believe this is where container-based integration will likely work better.

So that I can confirm my understanding...a scenario might be like this:

We have an AD connector that fetches the SIDs and we can read them etc.
For my environment, where there are no 'files' (there's only a transient network stream), we have an LCF 'Solr Field Filter Query' connector that decides which Filter Queries to apply (allow and deny) for the passed in SID(s).

For another environment, let's say, NTFS, there might be an 'NTFS' connector that would provide some kind of mapping of files/folders to SID(s). Since Solr wouldn't intrinscially know about this, the acl information would need to be stored somewhere in the index. This would mean extending the Solr schema and storing metadata at index time.
The alternative is to re-use the 'Solr Field Filter Query' connector for this as well (and any other document types that might be read in). This keeps the index 'clean' of acl-specific metadata, and allows for in-place changes and easy cross-document/index/instance access control.


If the above interpretation is [roughly] correct (please let me know if I've got this wrong!), this would reduce down to having:
   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.) (possibly/partly at the container level)
   2. At least an LCF Repository connector for 'acl.xml'
   3. Optional other LCF Repository connectors

It sounds like you've now finished the first half of 1. by adding the ability to get the required auth data from a Solr api call. The other half of 1. will be implementing the LCF interface in the SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.

Does the above sound like an accurate interpretation? Just trying to get a good picture of what work needs doing, where it goes, etc.

Many thanks!
Peter




On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:

>  >>>>>>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an
> access_token value might contain?) <<<<<<
>
> Documents have access/deny attributes; authorities simply provide the
> list of tokens that belong to an authenticated user.  Thus, there's no
> access/deny for an authority; that's attached to the document (as it
> is in real-world repositories).
>
> Let's run a quick example, using Active Directory and a Windows file
> system.  Suppose that you have a directory with documents in it, call
> it DirectoryA, and the directory allows read access to the following SIDs:
>
> S-123-456-76890
> S-23-64-12345
>
> These SIDs correspond to active directory groups, let's call them
> Group1 and Group2, respectively.
>
> DirectoryB also has documents in it, and those documents have just the
> SID S-123-456-76890 attached, because only Group1 can read its contents.
>
> Now, pretend that someone has created an LCF Active Directory
> authority connection (in the LCF UI), which is called "myAD", and this
> connection is set up to talk to the governing AD domain controller for
> this Windows file system.  We now know enough to describe the document indexing process:
>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
>
> Now, suppose that a user (let's call him "Peter") is authenticated
> with the AD domain controller.  Peter belongs to Group2, so his SIDs are (say):
>
> S-1-1-0 (the 'everyone' SID)
> S-323-999-12345 (his own personal user SID)
> S-23-64-12345 (the SID he gets because he belongs to group 2)
>
> We want to look up the documents in the search index that he can see.
> So, we ask the LCF authority service what his tokens are, and we get back:
>
> "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
>
> The documents we should return in his search are the ones matching his
> search criteria, PLUS the intersection of his tokens with the document
> ALLOW tokens, MINUS the intersection of his tokens with the document
> DENY tokens (there aren't any involved in this example).  So only
> files that have one of his three tokens as an ALLOW attribute would be returned.
>
> Note that what we are attempting to do is enforce AD's security with
> the search results we present.  There is no need to define a whole new
> security mechanism, because AD already has one that people use.
>
> >>>>>>
> One of the key requirements I've worked to adhere to in SOLR-1872 is
> to ensure there are no security or other dependencies of indexed data
> with any external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is
> that Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes
> from files.
> <<<<<<
>
> LCF is all about abstracting from repositories.  It's not specifically
> about a file system, although that is a convenient example.  If you
> are building your own kind of repository with your own security setup,
> that's fine - but in the LCF world you'd need to create an authority
> connector for your repository (which maybe reads your acl.xml file),
> as well as a repository connector (which hands documents to LCF and
> provides it with the access tokens that make security work).  Of
> course, you can something much lighter that doesn't include LCF at all
> if you are just integrating a custom repository of your own, but it
> sounded like you were interested in the broader problem here.
>
> So, LCF doesn't do "acl mapping" at all.  It relies on its various
> connectors to work cooperatively to define access tokens in a way that
> is consistent from authority connector to repository connector for a
> given repository kind.  Anybody can write a connector, so the beauty
> of all this is that you can build a system where data from many
> disparate sources is indexed, and security for each is simultaneously enforced.
>
> Karl
>
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Thursday, April 22, 2010 9:24 AM
>
> *To:* dev@lucene.apache.org
> *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for the diagram -
> Sorry about all the questions, but this raises a few new ones...
>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an
> access_token value might contain?)
>
> One of the key requirements I've worked to adhere to in SOLR-1872 is
> to ensure there are no security or other dependencies of indexed data
> with any external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is
> that Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes
> from files.
>
> This is one reason why SOLR-1872 uses filter queries for its
> access/deny tokens - so that all the required information for access
> control completely resides within the Solr index itself.
> Is the LCF architecture acl 'mapping' between Solr fields (queries)
> and users, some external 'repository' (files) and users, or arbitrary data (e.g.
> either of these)?
>
> I hope that makes sense...
>
> Thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
>
>> Hi Peter,
>>
>> I've attached a diagram that is not in the wiki as of yet, and I'll
>> try to answer your questions.
>>
>> >>>>>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
>> for a particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does
>> AD needs its schema extended?) Presumably, any such AD fields would
>> need to be queried for effective rights in order to cater for group
>> membership allows and denies.
>> <<<<<<
>>
>> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
>> strings that represent a contract between an LCF authority connection
>> and the LCF repository connection that picks up the documents (from wherever).
>>  These tokens thus have no real meaning outside of LCF.  You must
>> regard them as opaque.
>>
>> The contract, however, states that if you use the LCF authority
>> service to obtain tokens for an authenticated user, you will get back
>> a set that is CONSISTENT with the tokens that were attached to the
>> documents LCF sent to Solr for indexing in the first place.  So, you
>> don't have to worry about it, and that's kind of the idea.  So you imagine the following flow:
>>
>> (1) Use LCF to fetch documents and send them to Solr
>> (2) When searching, use the LCF authority service to get the desired
>> user's access tokens
>> (3) Either filter the results, or modify the query, to be sure the
>> access tokens all match up properly
>>
>> For the AD authority, the LCF access tokens consist, in part, of the
>> user's SIDs.  For other authorities, the access tokens are wildly different.
>>  You really don't want to know what's in them, since that's the job
>> of the LCF authority to determine. ;-)
>>
>> LCF is not, by the way, joined at the hip with AD.  However, in
>> practice, most enterprises in the world use some form of AD single
>> signon for their web applications, and even if they're using some
>> repository with its own idea of security, there's a mapping between
>> the AD users and the repository's users.  Doing that mapping is also
>> the job of the LCF authority for that repository.
>>
>> Hope this helps.  Also, I'm not expecting time miracles here, so
>> don't sweat the schedule.
>>
>>
>> Karl
>>
>>
>> ________________________________________
>> From: ext Peter Sturge [peter.sturge@googlemail.com]
>> Sent: Thursday, April 22, 2010 4:27 AM
>> To: dev@lucene.apache.org
>> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> connectors-dev@incubator.apache.org
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the quick turnaround.
>> I'm in the middle of a product release for us, so I fear I won't be
>> as quick as you... :-)
>>
>> I couldn't find a simple flow diagram or similar for LCF with regards
>> security (probably looking in the wrong place).
>> Perhaps you could help on these questions...?
>>
>> In SOLR-1872, the allows and denies are stored (in acl.xml) as
>> sub-queries, which are then used as filter queries in a user's search.
>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
>> for a particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does
>> AD needs its schema extended?) Presumably, any such AD fields would
>> need to be queried for effective rights in order to cater for group
>> membership allows and denies.
>>
>> I guess I'm just trying to understand the architectural
>> flow/storage/retrieval of data in the various parts of the system,
>> but I admit, I need to do more research on this.
>> After our product release, when I get a few more spare cycles, I can
>> look at it in more detail.
>>
>> Many thanks!
>> Peter
>>
>>
>>
>> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I just committed the promised changes to the LCF Solr output connector.
>>
>> ACL metadata will now be posted to the Solr Http interface along with
>> the document as the two following fields:
>>
>> __ACCESS_TOKEN__document
>> __DENY_TOKEN__document
>>
>> There will, of course, potentially be multiple values for each of
>> these two fields.
>>
>> Hope this helps,
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 6:51 PM
>>
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the info. I'll have a look at the link and try to take in
>> as much sugar as my insulin levels will handle...
>> It sounds like the necessary interface(s) are already in LCF - just a
>> matter of implementing them in the Solr 1872 plugin.
>> I'll need to digest the LCF stuff to get to grips with it..please
>> bear with me while I do that...
>>
>> When you say:
>>   The LCF solr output connection doesn't yet do this, but it is
>> trivial for me to make that happen.
>> Do you mean a mechanism by which solr.war can get url et al info from
>> its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>>
>>
>> Thanks,
>> Peter
>>
>>
>>
>>
>> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I'm the principal committer for LCF, but I don't know as much about
>> Solr as I ought to, so it sounds like a potentially productive collaboration.
>>
>> LCF does exactly what you are looking for - the only issue at all is
>> that you need to fetch a URL from a webapp to get what you are
>> looking for.  The "plugs" are all inside LCF for different kinds of
>> repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were:
>> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connect
>> ors+Framework+concepts
>>
>> The url would be something like this (on a locally installed
>> tomcat-based LCF instance):
>>
>>
>> http://localhost:8080/lcf-authority-service/UserACLs?username=someuse
>> rname@somedomain.com
>>
>> ... and this fetch returns something like:
>>
>> TOKEN:xxxxxxx
>> TOKEN:yyyyyyy
>> TOKEN:zzzzzzz
>> ....
>>
>> ... which represent the amalgamated tokens for all of the defined
>> authorities, and by some strange coincidence ( ;-) ) are compatible
>> with certain pieces of metadata that have been passed into Solr with
>> each document - one set of Allow tokens, and a second set of Deny
>> tokens.  The LCF solr output connection doesn't yet do this, but it
>> is trivial for me to make that happen.
>>
>> Does this sound plausible to you?
>>
>> Karl
>>
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 5:41 PM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
>> dev@lucene.apache.org>
>>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Integrating LCF to get external token support for SOLR-1872 sounds
>> very interesting indeed. I don't know anything about LCF, but one of
>> the things I was planning for SOLR-1872 is to make acl.xml (or rather
>> its behaviour) 'pluggable' - i.e. it would just be one of a series of
>> plugins that could be used for obtaining back-end authentication information.
>>
>> If you're good with LCF, perhaps we could work together to build this in.
>> One of the first things would be defining an interface that would be
>> as easy as possible to plug LCF into. Have you any
>> suggestions/insight on this front?
>>
>> Many thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> SOLR-1872 looks exactly like what I was envisioning, from the search
>> query perspective, although instead of the acl xml file you specify
>> LCF stipulates you would dynamically query the lcf-authority-service
>> servlet for the access tokens themselves.  That would get you support
>> for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems
>> likely that this component could be modified to work with LCF with minor effort.
>>
>> The missing component still seems to be AD authentication, which
>> needs a solution.
>>
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 10:44 AM
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> If you want to do this completely within Solr, have a look at:
>> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>>
>> Thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> FYI
>>
>> ________________________________
>> From: Wright Karl (Nokia-S/Cambridge)
>> Sent: Tuesday, April 20, 2010 8:16 AM
>> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
>> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>'; '
>> connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>'
>> Subject: RE: Solr and LCF security at query time
>>
>> Dominique,
>>
>> Yes, I am aware of this ticket and contribution.  Luckily LCF
>> establishes a powerful multi-repository security model, even though
>> it doesn't yet do the final step of enforcing that model at the
>> search end.  LCF allows you to define multiple authorities to operate
>> against disparate repositories, and use the appropriate authority to
>> secure any given document.  The solr people are aware of this design,
>> which addresses the issues raised by SOLR-1834 very nicely.  However,
>> as I said before, time is a problem, and the work still needs to be done.
>>
>> I suggest you read up on the actual security model of LCF, and
>> perhaps experiment with that and the SOLR-1834 contribution, to see
>> if there is common ground.  One thing we've learned at MetaCarta is
>> that post-filtering for security purposes is expensive, and it is
>> better to modify the queries themselves to restrict the results, if
>> possible.  I'm not sure which approach SOLR-1834 takes, although it
>> sounds like it might be the filtering approach.  Still, it would be better than nothing.
>>
>> Please let me know what you find out.
>>
>> Thanks,
>> Karl
>>
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
>> dominique.bejean@eolya.fr>]
>> Sent: Tuesday, April 20, 2010 8:03 AM
>> To: Wright Karl (Nokia-S/Cambridge)
>> Cc: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>;
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>
>> Subject: Re: Solr and LCF security at query time
>>
>> Karl,
>>
>> Thank you for your reply.
>>
>> I made some research today and I found this :
>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-183
>> 4 http://demo.findwise.se:8880/SolrSecurity/
>>
>> Sorl security model have to be able to filter result list with items
>> coming from various sources at the same time (livelink, documentum,
>> file system, ...). Big subject :)
>>
>> Dominique
>>
>>
>> Le 20/04/10 13:34,
>> karl.wright@nokia.com<ma...@nokia.com> a écrit :
>> Hi Dominique,
>>
>> At the moment, in order to enforce the LCF security model within
>> Lucene/Solr, you will need to build this functionality into whatever
>> client you are using to display the Lucene search results.
>> Specifically, you would need to take the following steps:
>>
>> (1) Have your users access your search client through Apache.
>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>> mod_authz_annotate, to cause authorization HTTP headers to be
>> transmitted to the client webapp.
>> (3) Have your client webapp alter whatever queries it is doing, to
>> add an appropriate query clause for each of the access tokens
>> transmitted in the headers.
>>
>> (This is how it is done at MetaCarta.)
>>
>> Alternatively, you may find a way to do this completely with a web
>> application under a Java app server such as Tomcat.  I have not yet
>> done the research to find out whether this is a feasible alternative.
>> Effectively, what you need something like mod_auth_kerb to do is to
>> authenticate your user against Active Directory, or whomever the authenticator ought to be.
>>  JAAS may be helpful here.
>>
>> There are, of course, intentions to fill out the missing pieces more
>> completely and transparently via a Solr search plugin and/or filter.
>> What has been lacking is time.  If you are in a position to do
>> development in this area, we're happy to have any assistance you might provide.
>>
>> Thanks,
>> Karl
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> Sent: Tuesday, April 20, 2010 5:06 AM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>>  Subject: Solr and LCF security at query time
>>
>> Hi,
>>
>> I don't see in LCF wiki how Solr and LCF works together at query time
>> in order to remove from the result list the items the user is not
>> allowed to access.
>>
>> In
>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concep
>> ts.html,
>> I just see these sentences :
>>
>> " Once all these documents and their access tokens are handed to the
>> search engine, it is the search engine's job to enforce security by
>> excluding inappropriate documents from the search results. For
>> Lucene, this infrastructure is expected to be built on top of
>> Lucene's generic metadata abilities, but has not been implemented at this time."
>>
>> I am not sure to understand. Does this mean that for the moment, it
>> is not possible for Solr to apply security by using an Authority Connector ?
>>
>> Dominique
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>> additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

>>>>>>
For general Solr access control, there's two layers of security that need to be addressed:
  1. Authentication - make sure the incoming query is from a valid user, and the passed-in credentials (hash, certificate etc.) are correct
  2. Query filtering - potentially reduce the number/type of returned results based on the allow/deny metadata for the authenticated user

I can see how the LCF auth connector works for 2., but can it do 1. as well?
<<<<<<

The authority connectors don't perform authentication at this time.  In fact, LCF has nothing to do with authentication at all - just authorization.  The reason for this is because it is almost never the case that somebody wants to provide multiple credentials in order to be able see their results.  Most enterprises who have multiple repositories authenticate against AD and then map AD user names to repository user names in order to access those repositories.  If you noted my earlier posts from this morning, you may have noted that I'm looking at recommending JAAS plus sun's kerb5 login module for handling the "authenticate against AD" case, which would cover some 95%+ of the real world authentication needed out there.


>>>>>>
I think this is the bit that is worrying me - is this storing the SIDs into Solr at document index time? This would be a problem for a whole load of reasons, but maybe I'm missing something here? (see below for a possible
alternative)
<<<<<<

Yes, the idea is to store SIDs in solr at index time.  I don't know enough about solr to know what kinds of issues this might entail, but Lucene certainly has a model of metadata that's pretty flexible, so I don't think this would be difficult at all.  Eric Hatcher also seemed to confirm my suspicions that this would not be a problem.

>>>>>>
The problem that arises from 'pluggable' authentication is that, if you're not using a certificate, you have to start with a password, but the connector only has access to the password hash (unless the pwd is sent in the query url). I don't know of a way to confirm identities in AD using only the username and hash (AD does the hash compare). I believe this is where container-based integration will likely work better.
<<<<<<

This is exactly why I think that we need to do the authentication upstream of the authority world.

>>>>>>
For another environment, let's say, NTFS, there might be an 'NTFS' connector that would provide some kind of mapping of files/folders to SID(s). Since Solr wouldn't intrinscially know about this, the acl information would need to be stored somewhere in the index. This would mean extending the Solr schema and storing metadata at index time.
<<<<<<

If Solr handles arbitrary document metadata, then I think we could just use that feature.  But you know more about it than me, at this point.  It would be great to get an overview of potential ways of doing this.

>>>>>>
If the above interpretation is [roughly] correct (please let me know if I've got this wrong!), this would reduce down to having:
   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.) (possibly/partly at the container level)
   2. At least an LCF Repository connector for 'acl.xml'
   3. Optional other LCF Repository connectors
<<<<<<

For your particular task, it sounds like you are trying to read from NTFS and apply security after-the-fact with some acl specification file.  In that case, I'd write a repository connector that was based on the file system connector (already part of the stable of connectors for LCF) which reads ACL information from your acl.xml file.  Or, if you prefer a UI for specifying ACL information, you could extend the connector so that security is configured in the UI without having an external acl.xml file at all - which would be a nice addition to the existing file system connector.  (Repository connections and jobs are configured internally in LCF by XML documents stored in the database, so they can be arbitrarily structured.  I'm happy to help you figure out how to do this if this is what you decide to do.)

I think we still need to add in the authentication piece to make this all work for you, so perhaps you can describe how you expect a user to interact with your system, so I can understand your design issues.

Thanks,
Karl

-----Original Message-----
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 11:32 AM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks very much for your detailed explanation - really good!

As I've thought through some of the implications, I've added comments below, so I hope they don't seem too jumbled...

I suppose on the 'authority' side, it works kind of as I envisioned it would.

For general Solr access control, there's two layers of security that need to be addressed:
  1. Authentication - make sure the incoming query is from a valid user, and the passed-in credentials (hash, certificate etc.) are correct
  2. Query filtering - potentially reduce the number/type of returned results based on the allow/deny metadata for the authenticated user

I can see how the LCF auth connector works for 2., but can it do 1. as well?
It would be good if this could somehow be integrated into any container (Tomcat/Jetty et al) authentication that might be configured (probably related to your previous post). I many ways, it could/should be that the Authority (AD) part of the connector should only be concerned with 1. and not 2. (see below).

So, on the repository side, there is also an LCF connector that 'closes the loop' to provide the 'what is it I'm trying to control' side of things.
I understand that LCF doesn't do the mapping - it delegates this task to the caller, but provides both sides of the equation (authority & repository).

>>>>>
- Each file in DirectoryA will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
- Each file in DirectoryB will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
<<<<<
I think this is the bit that is worrying me - is this storing the SIDs into Solr at document index time? This would be a problem for a whole load of reasons, but maybe I'm missing something here? (see below for a possible
alternative)

Basically, what I'm getting at here is that the allow/deny values need to be stored in one of three places:
  1. In the authority (e.g. inside AD)
  2. In the document metadata (index-time)
  3. In external storage (e.g. acl.xml, NTFS etc.)

1. Extending AD is pretty much out, as this causes too many interop problems 2. 'Hard-coding' acl information in the index makes it non-portable, resistent to changes, etc.
3. acl.xml is coupled with a Solr instance, but is easily ported/replicated.
Storing/retrieving acl information from the source (e.g. NTFS) is problematic, as the source may not be accessible (it may not even exist).

I believe 3. or a variant is the way to go on the repo side, which means the LCF Authority connector is mainly for Authentication (see above), which is what you want from AD et al integration.
The problem that arises from 'pluggable' authentication is that, if you're not using a certificate, you have to start with a password, but the connector only has access to the password hash (unless the pwd is sent in the query url). I don't know of a way to confirm identities in AD using only the username and hash (AD does the hash compare). I believe this is where container-based integration will likely work better.

So that I can confirm my understanding...a scenario might be like this:

We have an AD connector that fetches the SIDs and we can read them etc.
For my environment, where there are no 'files' (there's only a transient network stream), we have an LCF 'Solr Field Filter Query' connector that decides which Filter Queries to apply (allow and deny) for the passed in SID(s).

For another environment, let's say, NTFS, there might be an 'NTFS' connector that would provide some kind of mapping of files/folders to SID(s). Since Solr wouldn't intrinscially know about this, the acl information would need to be stored somewhere in the index. This would mean extending the Solr schema and storing metadata at index time.
The alternative is to re-use the 'Solr Field Filter Query' connector for this as well (and any other document types that might be read in). This keeps the index 'clean' of acl-specific metadata, and allows for in-place changes and easy cross-document/index/instance access control.


If the above interpretation is [roughly] correct (please let me know if I've got this wrong!), this would reduce down to having:
   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.) (possibly/partly at the container level)
   2. At least an LCF Repository connector for 'acl.xml'
   3. Optional other LCF Repository connectors

It sounds like you've now finished the first half of 1. by adding the ability to get the required auth data from a Solr api call. The other half of 1. will be implementing the LCF interface in the SolrACLSecurity class, to effectively replace the 'user', 'group' and 'password' bits of acl.xml.

Does the above sound like an accurate interpretation? Just trying to get a good picture of what work needs doing, where it goes, etc.

Many thanks!
Peter




On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:

>  >>>>>>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an
> access_token value might contain?) <<<<<<
>
> Documents have access/deny attributes; authorities simply provide the
> list of tokens that belong to an authenticated user.  Thus, there's no
> access/deny for an authority; that's attached to the document (as it
> is in real-world repositories).
>
> Let's run a quick example, using Active Directory and a Windows file
> system.  Suppose that you have a directory with documents in it, call
> it DirectoryA, and the directory allows read access to the following SIDs:
>
> S-123-456-76890
> S-23-64-12345
>
> These SIDs correspond to active directory groups, let's call them
> Group1 and Group2, respectively.
>
> DirectoryB also has documents in it, and those documents have just the
> SID S-123-456-76890 attached, because only Group1 can read its contents.
>
> Now, pretend that someone has created an LCF Active Directory
> authority connection (in the LCF UI), which is called "myAD", and this
> connection is set up to talk to the governing AD domain controller for
> this Windows file system.  We now know enough to describe the document indexing process:
>
> - Each file in DirectoryA will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following
> __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"
>
> Now, suppose that a user (let's call him "Peter") is authenticated
> with the AD domain controller.  Peter belongs to Group2, so his SIDs are (say):
>
> S-1-1-0 (the 'everyone' SID)
> S-323-999-12345 (his own personal user SID)
> S-23-64-12345 (the SID he gets because he belongs to group 2)
>
> We want to look up the documents in the search index that he can see.
> So, we ask the LCF authority service what his tokens are, and we get back:
>
> "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
>
> The documents we should return in his search are the ones matching his
> search criteria, PLUS the intersection of his tokens with the document
> ALLOW tokens, MINUS the intersection of his tokens with the document
> DENY tokens (there aren't any involved in this example).  So only
> files that have one of his three tokens as an ALLOW attribute would be returned.
>
> Note that what we are attempting to do is enforce AD's security with
> the search results we present.  There is no need to define a whole new
> security mechanism, because AD already has one that people use.
>
> >>>>>>
> One of the key requirements I've worked to adhere to in SOLR-1872 is
> to ensure there are no security or other dependencies of indexed data
> with any external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is
> that Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes
> from files.
> <<<<<<
>
> LCF is all about abstracting from repositories.  It's not specifically
> about a file system, although that is a convenient example.  If you
> are building your own kind of repository with your own security setup,
> that's fine - but in the LCF world you'd need to create an authority
> connector for your repository (which maybe reads your acl.xml file),
> as well as a repository connector (which hands documents to LCF and
> provides it with the access tokens that make security work).  Of
> course, you can something much lighter that doesn't include LCF at all
> if you are just integrating a custom repository of your own, but it
> sounded like you were interested in the broader problem here.
>
> So, LCF doesn't do "acl mapping" at all.  It relies on its various
> connectors to work cooperatively to define access tokens in a way that
> is consistent from authority connector to repository connector for a
> given repository kind.  Anybody can write a connector, so the beauty
> of all this is that you can build a system where data from many
> disparate sources is indexed, and security for each is simultaneously enforced.
>
> Karl
>
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Thursday, April 22, 2010 9:24 AM
>
> *To:* dev@lucene.apache.org
> *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for the diagram -
> Sorry about all the questions, but this raises a few new ones...
>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an
> access_token value might contain?)
>
> One of the key requirements I've worked to adhere to in SOLR-1872 is
> to ensure there are no security or other dependencies of indexed data
> with any external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is
> that Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes
> from files.
>
> This is one reason why SOLR-1872 uses filter queries for its
> access/deny tokens - so that all the required information for access
> control completely resides within the Solr index itself.
> Is the LCF architecture acl 'mapping' between Solr fields (queries)
> and users, some external 'repository' (files) and users, or arbitrary data (e.g.
> either of these)?
>
> I hope that makes sense...
>
> Thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
>
>> Hi Peter,
>>
>> I've attached a diagram that is not in the wiki as of yet, and I'll
>> try to answer your questions.
>>
>> >>>>>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
>> for a particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does
>> AD needs its schema extended?) Presumably, any such AD fields would
>> need to be queried for effective rights in order to cater for group
>> membership allows and denies.
>> <<<<<<
>>
>> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
>> strings that represent a contract between an LCF authority connection
>> and the LCF repository connection that picks up the documents (from wherever).
>>  These tokens thus have no real meaning outside of LCF.  You must
>> regard them as opaque.
>>
>> The contract, however, states that if you use the LCF authority
>> service to obtain tokens for an authenticated user, you will get back
>> a set that is CONSISTENT with the tokens that were attached to the
>> documents LCF sent to Solr for indexing in the first place.  So, you
>> don't have to worry about it, and that's kind of the idea.  So you imagine the following flow:
>>
>> (1) Use LCF to fetch documents and send them to Solr
>> (2) When searching, use the LCF authority service to get the desired
>> user's access tokens
>> (3) Either filter the results, or modify the query, to be sure the
>> access tokens all match up properly
>>
>> For the AD authority, the LCF access tokens consist, in part, of the
>> user's SIDs.  For other authorities, the access tokens are wildly different.
>>  You really don't want to know what's in them, since that's the job
>> of the LCF authority to determine. ;-)
>>
>> LCF is not, by the way, joined at the hip with AD.  However, in
>> practice, most enterprises in the world use some form of AD single
>> signon for their web applications, and even if they're using some
>> repository with its own idea of security, there's a mapping between
>> the AD users and the repository's users.  Doing that mapping is also
>> the job of the LCF authority for that repository.
>>
>> Hope this helps.  Also, I'm not expecting time miracles here, so
>> don't sweat the schedule.
>>
>>
>> Karl
>>
>>
>> ________________________________________
>> From: ext Peter Sturge [peter.sturge@googlemail.com]
>> Sent: Thursday, April 22, 2010 4:27 AM
>> To: dev@lucene.apache.org
>> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> connectors-dev@incubator.apache.org
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the quick turnaround.
>> I'm in the middle of a product release for us, so I fear I won't be
>> as quick as you... :-)
>>
>> I couldn't find a simple flow diagram or similar for LCF with regards
>> security (probably looking in the wrong place).
>> Perhaps you could help on these questions...?
>>
>> In SOLR-1872, the allows and denies are stored (in acl.xml) as
>> sub-queries, which are then used as filter queries in a user's search.
>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored
>> for a particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does
>> AD needs its schema extended?) Presumably, any such AD fields would
>> need to be queried for effective rights in order to cater for group
>> membership allows and denies.
>>
>> I guess I'm just trying to understand the architectural
>> flow/storage/retrieval of data in the various parts of the system,
>> but I admit, I need to do more research on this.
>> After our product release, when I get a few more spare cycles, I can
>> look at it in more detail.
>>
>> Many thanks!
>> Peter
>>
>>
>>
>> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I just committed the promised changes to the LCF Solr output connector.
>>
>> ACL metadata will now be posted to the Solr Http interface along with
>> the document as the two following fields:
>>
>> __ACCESS_TOKEN__document
>> __DENY_TOKEN__document
>>
>> There will, of course, potentially be multiple values for each of
>> these two fields.
>>
>> Hope this helps,
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 6:51 PM
>>
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the info. I'll have a look at the link and try to take in
>> as much sugar as my insulin levels will handle...
>> It sounds like the necessary interface(s) are already in LCF - just a
>> matter of implementing them in the Solr 1872 plugin.
>> I'll need to digest the LCF stuff to get to grips with it..please
>> bear with me while I do that...
>>
>> When you say:
>>   The LCF solr output connection doesn't yet do this, but it is
>> trivial for me to make that happen.
>> Do you mean a mechanism by which solr.war can get url et al info from
>> its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>>
>>
>> Thanks,
>> Peter
>>
>>
>>
>>
>> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I'm the principal committer for LCF, but I don't know as much about
>> Solr as I ought to, so it sounds like a potentially productive collaboration.
>>
>> LCF does exactly what you are looking for - the only issue at all is
>> that you need to fetch a URL from a webapp to get what you are
>> looking for.  The "plugs" are all inside LCF for different kinds of
>> repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were:
>> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connect
>> ors+Framework+concepts
>>
>> The url would be something like this (on a locally installed
>> tomcat-based LCF instance):
>>
>>
>> http://localhost:8080/lcf-authority-service/UserACLs?username=someuse
>> rname@somedomain.com
>>
>> ... and this fetch returns something like:
>>
>> TOKEN:xxxxxxx
>> TOKEN:yyyyyyy
>> TOKEN:zzzzzzz
>> ....
>>
>> ... which represent the amalgamated tokens for all of the defined
>> authorities, and by some strange coincidence ( ;-) ) are compatible
>> with certain pieces of metadata that have been passed into Solr with
>> each document - one set of Allow tokens, and a second set of Deny
>> tokens.  The LCF solr output connection doesn't yet do this, but it
>> is trivial for me to make that happen.
>>
>> Does this sound plausible to you?
>>
>> Karl
>>
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 5:41 PM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
>> dev@lucene.apache.org>
>>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Integrating LCF to get external token support for SOLR-1872 sounds
>> very interesting indeed. I don't know anything about LCF, but one of
>> the things I was planning for SOLR-1872 is to make acl.xml (or rather
>> its behaviour) 'pluggable' - i.e. it would just be one of a series of
>> plugins that could be used for obtaining back-end authentication information.
>>
>> If you're good with LCF, perhaps we could work together to build this in.
>> One of the first things would be defining an interface that would be
>> as easy as possible to plug LCF into. Have you any
>> suggestions/insight on this front?
>>
>> Many thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> SOLR-1872 looks exactly like what I was envisioning, from the search
>> query perspective, although instead of the acl xml file you specify
>> LCF stipulates you would dynamically query the lcf-authority-service
>> servlet for the access tokens themselves.  That would get you support
>> for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems
>> likely that this component could be modified to work with LCF with minor effort.
>>
>> The missing component still seems to be AD authentication, which
>> needs a solution.
>>
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 10:44 AM
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> If you want to do this completely within Solr, have a look at:
>> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>>
>> Thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> FYI
>>
>> ________________________________
>> From: Wright Karl (Nokia-S/Cambridge)
>> Sent: Tuesday, April 20, 2010 8:16 AM
>> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
>> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>'; '
>> connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>'
>> Subject: RE: Solr and LCF security at query time
>>
>> Dominique,
>>
>> Yes, I am aware of this ticket and contribution.  Luckily LCF
>> establishes a powerful multi-repository security model, even though
>> it doesn't yet do the final step of enforcing that model at the
>> search end.  LCF allows you to define multiple authorities to operate
>> against disparate repositories, and use the appropriate authority to
>> secure any given document.  The solr people are aware of this design,
>> which addresses the issues raised by SOLR-1834 very nicely.  However,
>> as I said before, time is a problem, and the work still needs to be done.
>>
>> I suggest you read up on the actual security model of LCF, and
>> perhaps experiment with that and the SOLR-1834 contribution, to see
>> if there is common ground.  One thing we've learned at MetaCarta is
>> that post-filtering for security purposes is expensive, and it is
>> better to modify the queries themselves to restrict the results, if
>> possible.  I'm not sure which approach SOLR-1834 takes, although it
>> sounds like it might be the filtering approach.  Still, it would be better than nothing.
>>
>> Please let me know what you find out.
>>
>> Thanks,
>> Karl
>>
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
>> dominique.bejean@eolya.fr>]
>> Sent: Tuesday, April 20, 2010 8:03 AM
>> To: Wright Karl (Nokia-S/Cambridge)
>> Cc: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>;
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>
>> Subject: Re: Solr and LCF security at query time
>>
>> Karl,
>>
>> Thank you for your reply.
>>
>> I made some research today and I found this :
>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-183
>> 4 http://demo.findwise.se:8880/SolrSecurity/
>>
>> Sorl security model have to be able to filter result list with items
>> coming from various sources at the same time (livelink, documentum,
>> file system, ...). Big subject :)
>>
>> Dominique
>>
>>
>> Le 20/04/10 13:34,
>> karl.wright@nokia.com<ma...@nokia.com> a écrit :
>> Hi Dominique,
>>
>> At the moment, in order to enforce the LCF security model within
>> Lucene/Solr, you will need to build this functionality into whatever
>> client you are using to display the Lucene search results.
>> Specifically, you would need to take the following steps:
>>
>> (1) Have your users access your search client through Apache.
>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>> mod_authz_annotate, to cause authorization HTTP headers to be
>> transmitted to the client webapp.
>> (3) Have your client webapp alter whatever queries it is doing, to
>> add an appropriate query clause for each of the access tokens
>> transmitted in the headers.
>>
>> (This is how it is done at MetaCarta.)
>>
>> Alternatively, you may find a way to do this completely with a web
>> application under a Java app server such as Tomcat.  I have not yet
>> done the research to find out whether this is a feasible alternative.
>> Effectively, what you need something like mod_auth_kerb to do is to
>> authenticate your user against Active Directory, or whomever the authenticator ought to be.
>>  JAAS may be helpful here.
>>
>> There are, of course, intentions to fill out the missing pieces more
>> completely and transparently via a Solr search plugin and/or filter.
>> What has been lacking is time.  If you are in a position to do
>> development in this area, we're happy to have any assistance you might provide.
>>
>> Thanks,
>> Karl
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> Sent: Tuesday, April 20, 2010 5:06 AM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>>  Subject: Solr and LCF security at query time
>>
>> Hi,
>>
>> I don't see in LCF wiki how Solr and LCF works together at query time
>> in order to remove from the result list the items the user is not
>> allowed to access.
>>
>> In
>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concep
>> ts.html,
>> I just see these sentences :
>>
>> " Once all these documents and their access tokens are handed to the
>> search engine, it is the search engine's job to enforce security by
>> excluding inappropriate documents from the search results. For
>> Lucene, this infrastructure is expected to be built on top of
>> Lucene's generic metadata abilities, but has not been implemented at this time."
>>
>> I am not sure to understand. Does this mean that for the moment, it
>> is not possible for Solr to apply security by using an Authority Connector ?
>>
>> Dominique
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>> additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Thanks very much for your detailed explanation - really good!

As I've thought through some of the implications, I've added comments below,
so I hope they don't seem too jumbled...

I suppose on the 'authority' side, it works kind of as I envisioned it
would.

For general Solr access control, there's two layers of security that need to
be addressed:
  1. Authentication - make sure the incoming query is from a valid user, and
the passed-in credentials (hash, certificate etc.) are correct
  2. Query filtering - potentially reduce the number/type of returned
results based on the allow/deny metadata for the authenticated user

I can see how the LCF auth connector works for 2., but can it do 1. as well?
It would be good if this could somehow be integrated into any container
(Tomcat/Jetty et al) authentication that might be configured (probably
related to your previous post). I many ways, it could/should be that the
Authority (AD) part of the connector should only be concerned with 1. and
not 2. (see below).

So, on the repository side, there is also an LCF connector that 'closes the
loop' to provide the 'what is it I'm trying to control' side of things.
I understand that LCF doesn't do the mapping - it delegates this task to the
caller, but provides both sides of the equation (authority & repository).

>>>>>
- Each file in DirectoryA will have the following __ALLOW_TOKEN__document
attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
- Each file in DirectoryB will have the following __ALLOW_TOKEN__document
attributes inside Solr: "myAD:S-123-456-76890"
<<<<<
I think this is the bit that is worrying me - is this storing the SIDs into
Solr at document index time? This would be a problem for a whole load of
reasons, but maybe I'm missing something here? (see below for a possible
alternative)

Basically, what I'm getting at here is that the allow/deny values need to be
stored in one of three places:
  1. In the authority (e.g. inside AD)
  2. In the document metadata (index-time)
  3. In external storage (e.g. acl.xml, NTFS etc.)

1. Extending AD is pretty much out, as this causes too many interop problems
2. 'Hard-coding' acl information in the index makes it non-portable,
resistent to changes, etc.
3. acl.xml is coupled with a Solr instance, but is easily ported/replicated.
Storing/retrieving acl information from the source (e.g. NTFS) is
problematic, as the source may not be accessible (it may not even exist).

I believe 3. or a variant is the way to go on the repo side, which means the
LCF Authority connector is mainly for Authentication (see above), which is
what you want from AD et al integration.
The problem that arises from 'pluggable' authentication is that, if you're
not using a certificate, you have to start with a password, but the
connector only has access to the password hash (unless the pwd is sent in
the query url). I don't know of a way to confirm identities in AD using only
the username and hash (AD does the hash compare). I believe this is where
container-based integration will likely work better.

So that I can confirm my understanding...a scenario might be like this:

We have an AD connector that fetches the SIDs and we can read them etc.
For my environment, where there are no 'files' (there's only a transient
network stream), we have an LCF 'Solr Field Filter Query' connector that
decides which Filter Queries to apply (allow and deny) for the passed in
SID(s).

For another environment, let's say, NTFS, there might be an 'NTFS' connector
that would provide some kind of mapping of files/folders to SID(s). Since
Solr wouldn't intrinscially know about this, the acl information would need
to be stored somewhere in the index. This would mean extending the Solr
schema and storing metadata at index time.
The alternative is to re-use the 'Solr Field Filter Query' connector for
this as well (and any other document types that might be read in). This
keeps the index 'clean' of acl-specific metadata, and allows for in-place
changes and easy cross-document/index/instance access control.


If the above interpretation is [roughly] correct (please let me know if I've
got this wrong!), this would reduce down to having:
   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
(possibly/partly at the container level)
   2. At least an LCF Repository connector for 'acl.xml'
   3. Optional other LCF Repository connectors

It sounds like you've now finished the first half of 1. by adding the
ability to get the required auth data from a Solr api call. The other half
of 1. will be implementing the LCF interface in the SolrACLSecurity class,
to effectively replace the 'user', 'group' and 'password' bits of acl.xml.

Does the above sound like an accurate interpretation? Just trying to get a
good picture of what work needs doing, where it goes, etc.

Many thanks!
Peter




On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:

>  >>>>>>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an access_token
> value might contain?)
> <<<<<<
>
> Documents have access/deny attributes; authorities simply provide the list
> of tokens that belong to an authenticated user.  Thus, there's no
> access/deny for an authority; that's attached to the document (as it is in
> real-world repositories).
>
> Let's run a quick example, using Active Directory and a Windows file
> system.  Suppose that you have a directory with documents in it, call it
> DirectoryA, and the directory allows read access to the following SIDs:
>
> S-123-456-76890
> S-23-64-12345
>
> These SIDs correspond to active directory groups, let's call them Group1
> and Group2, respectively.
>
> DirectoryB also has documents in it, and those documents have just the SID
> S-123-456-76890 attached, because only Group1 can read its contents.
>
> Now, pretend that someone has created an LCF Active Directory authority
> connection (in the LCF UI), which is called "myAD", and this connection is
> set up to talk to the governing AD domain controller for this Windows file
> system.  We now know enough to describe the document indexing process:
>
> - Each file in DirectoryA will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890"
>
> Now, suppose that a user (let's call him "Peter") is authenticated with the
> AD domain controller.  Peter belongs to Group2, so his SIDs are (say):
>
> S-1-1-0 (the 'everyone' SID)
> S-323-999-12345 (his own personal user SID)
> S-23-64-12345 (the SID he gets because he belongs to group 2)
>
> We want to look up the documents in the search index that he can see.  So,
> we ask the LCF authority service what his tokens are, and we get back:
>
> "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
>
> The documents we should return in his search are the ones matching his
> search criteria, PLUS the intersection of his tokens with the document ALLOW
> tokens, MINUS the intersection of his tokens with the document DENY tokens
> (there aren't any involved in this example).  So only files that have one
> of his three tokens as an ALLOW attribute would be returned.
>
> Note that what we are attempting to do is enforce AD's security with the
> search results we present.  There is no need to define a whole new security
> mechanism, because AD already has one that people use.
>
> >>>>>>
> One of the key requirements I've worked to adhere to in SOLR-1872 is to
> ensure there are no security or other dependencies of indexed data with any
> external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is that
> Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes from
> files.
> <<<<<<
>
> LCF is all about abstracting from repositories.  It's not specifically
> about a file system, although that is a convenient example.  If you are
> building your own kind of repository with your own security setup, that's
> fine - but in the LCF world you'd need to create an authority connector for
> your repository (which maybe reads your acl.xml file), as well as a
> repository connector (which hands documents to LCF and provides it with the
> access tokens that make security work).  Of course, you can something much
> lighter that doesn't include LCF at all if you are just integrating a custom
> repository of your own, but it sounded like you were interested in the
> broader problem here.
>
> So, LCF doesn't do "acl mapping" at all.  It relies on its various
> connectors to work cooperatively to define access tokens in a way that is
> consistent from authority connector to repository connector for a given
> repository kind.  Anybody can write a connector, so the beauty of all this
> is that you can build a system where data from many disparate sources is
> indexed, and security for each is simultaneously enforced.
>
> Karl
>
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Thursday, April 22, 2010 9:24 AM
>
> *To:* dev@lucene.apache.org
> *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for the diagram -
> Sorry about all the questions, but this raises a few new ones...
>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an access_token
> value might contain?)
>
> One of the key requirements I've worked to adhere to in SOLR-1872 is to
> ensure there are no security or other dependencies of indexed data with any
> external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is that
> Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes from
> files.
>
> This is one reason why SOLR-1872 uses filter queries for its access/deny
> tokens - so that all the required information for access control completely
> resides within the Solr index itself.
> Is the LCF architecture acl 'mapping' between Solr fields (queries) and
> users, some external 'repository' (files) and users, or arbitrary data (e.g.
> either of these)?
>
> I hope that makes sense...
>
> Thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
>
>> Hi Peter,
>>
>> I've attached a diagram that is not in the wiki as of yet, and I'll try to
>> answer your questions.
>>
>> >>>>>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
>> particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does AD
>> needs its schema extended?)
>> Presumably, any such AD fields would need to be queried for effective
>> rights in order to cater for group membership allows and denies.
>> <<<<<<
>>
>> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
>> strings that represent a contract between an LCF authority connection and
>> the LCF repository connection that picks up the documents (from wherever).
>>  These tokens thus have no real meaning outside of LCF.  You must regard
>> them as opaque.
>>
>> The contract, however, states that if you use the LCF authority service to
>> obtain tokens for an authenticated user, you will get back a set that is
>> CONSISTENT with the tokens that were attached to the documents LCF sent to
>> Solr for indexing in the first place.  So, you don't have to worry about it,
>> and that's kind of the idea.  So you imagine the following flow:
>>
>> (1) Use LCF to fetch documents and send them to Solr
>> (2) When searching, use the LCF authority service to get the desired
>> user's access tokens
>> (3) Either filter the results, or modify the query, to be sure the access
>> tokens all match up properly
>>
>> For the AD authority, the LCF access tokens consist, in part, of the
>> user's SIDs.  For other authorities, the access tokens are wildly different.
>>  You really don't want to know what's in them, since that's the job of the
>> LCF authority to determine. ;-)
>>
>> LCF is not, by the way, joined at the hip with AD.  However, in practice,
>> most enterprises in the world use some form of AD single signon for their
>> web applications, and even if they're using some repository with its own
>> idea of security, there's a mapping between the AD users and the
>> repository's users.  Doing that mapping is also the job of the LCF authority
>> for that repository.
>>
>> Hope this helps.  Also, I'm not expecting time miracles here, so don't
>> sweat the schedule.
>>
>>
>> Karl
>>
>>
>> ________________________________________
>> From: ext Peter Sturge [peter.sturge@googlemail.com]
>> Sent: Thursday, April 22, 2010 4:27 AM
>> To: dev@lucene.apache.org
>> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> connectors-dev@incubator.apache.org
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the quick turnaround.
>> I'm in the middle of a product release for us, so I fear I won't be as
>> quick as you... :-)
>>
>> I couldn't find a simple flow diagram or similar for LCF with regards
>> security (probably looking in the wrong place).
>> Perhaps you could help on these questions...?
>>
>> In SOLR-1872, the allows and denies are stored (in acl.xml) as
>> sub-queries, which are then used as filter queries in a user's search.
>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
>> particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does AD
>> needs its schema extended?)
>> Presumably, any such AD fields would need to be queried for effective
>> rights in order to cater for group membership allows and denies.
>>
>> I guess I'm just trying to understand the architectural
>> flow/storage/retrieval of data in the various parts of the system, but I
>> admit, I need to do more research on this.
>> After our product release, when I get a few more spare cycles, I can look
>> at it in more detail.
>>
>> Many thanks!
>> Peter
>>
>>
>>
>> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I just committed the promised changes to the LCF Solr output connector.
>>
>> ACL metadata will now be posted to the Solr Http interface along with the
>> document as the two following fields:
>>
>> __ACCESS_TOKEN__document
>> __DENY_TOKEN__document
>>
>> There will, of course, potentially be multiple values for each of these
>> two fields.
>>
>> Hope this helps,
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 6:51 PM
>>
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the info. I'll have a look at the link and try to take in as
>> much sugar as my insulin levels will handle...
>> It sounds like the necessary interface(s) are already in LCF - just a
>> matter of implementing them in the Solr 1872 plugin.
>> I'll need to digest the LCF stuff to get to grips with it..please bear
>> with me while I do that...
>>
>> When you say:
>>   The LCF solr output connection doesn't yet do this, but it is trivial
>> for me to make that happen.
>> Do you mean a mechanism by which solr.war can get url et al info from its
>> parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>>
>>
>> Thanks,
>> Peter
>>
>>
>>
>>
>> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I'm the principal committer for LCF, but I don't know as much about Solr
>> as I ought to, so it sounds like a potentially productive collaboration.
>>
>> LCF does exactly what you are looking for - the only issue at all is that
>> you need to fetch a URL from a webapp to get what you are looking for.  The
>> "plugs" are all inside LCF for different kinds of repositories.  Here's a
>> link that might help with drinking the LCF "koolaid", as it were:
>> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>>
>> The url would be something like this (on a locally installed tomcat-based
>> LCF instance):
>>
>>
>> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>>
>> ... and this fetch returns something like:
>>
>> TOKEN:xxxxxxx
>> TOKEN:yyyyyyy
>> TOKEN:zzzzzzz
>> ....
>>
>> ... which represent the amalgamated tokens for all of the defined
>> authorities, and by some strange coincidence ( ;-) ) are compatible with
>> certain pieces of metadata that have been passed into Solr with each
>> document - one set of Allow tokens, and a second set of Deny tokens.  The
>> LCF solr output connection doesn't yet do this, but it is trivial for me to
>> make that happen.
>>
>> Does this sound plausible to you?
>>
>> Karl
>>
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 5:41 PM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
>> dev@lucene.apache.org>
>>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Integrating LCF to get external token support for SOLR-1872 sounds very
>> interesting indeed. I don't know anything about LCF, but one of the things I
>> was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
>> 'pluggable' - i.e. it would just be one of a series of plugins that could be
>> used for obtaining back-end authentication information.
>>
>> If you're good with LCF, perhaps we could work together to build this in.
>> One of the first things would be defining an interface that would be as easy
>> as possible to plug LCF into. Have you any suggestions/insight on this
>> front?
>>
>> Many thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> SOLR-1872 looks exactly like what I was envisioning, from the search query
>> perspective, although instead of the acl xml file you specify LCF stipulates
>> you would dynamically query the lcf-authority-service servlet for the access
>> tokens themselves.  That would get you support for AD, Documentum, LiveLink,
>> Meridio, and Memex for free. It seems likely that this component could be
>> modified to work with LCF with minor effort.
>>
>> The missing component still seems to be AD authentication, which needs a
>> solution.
>>
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 10:44 AM
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> If you want to do this completely within Solr, have a look at:
>> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>>
>> Thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> FYI
>>
>> ________________________________
>> From: Wright Karl (Nokia-S/Cambridge)
>> Sent: Tuesday, April 20, 2010 8:16 AM
>> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
>> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>'; '
>> connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>'
>> Subject: RE: Solr and LCF security at query time
>>
>> Dominique,
>>
>> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes
>> a powerful multi-repository security model, even though it doesn't yet do
>> the final step of enforcing that model at the search end.  LCF allows you to
>> define multiple authorities to operate against disparate repositories, and
>> use the appropriate authority to secure any given document.  The solr people
>> are aware of this design, which addresses the issues raised by SOLR-1834
>> very nicely.  However, as I said before, time is a problem, and the work
>> still needs to be done.
>>
>> I suggest you read up on the actual security model of LCF, and perhaps
>> experiment with that and the SOLR-1834 contribution, to see if there is
>> common ground.  One thing we've learned at MetaCarta is that post-filtering
>> for security purposes is expensive, and it is better to modify the queries
>> themselves to restrict the results, if possible.  I'm not sure which
>> approach SOLR-1834 takes, although it sounds like it might be the filtering
>> approach.  Still, it would be better than nothing.
>>
>> Please let me know what you find out.
>>
>> Thanks,
>> Karl
>>
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
>> dominique.bejean@eolya.fr>]
>> Sent: Tuesday, April 20, 2010 8:03 AM
>> To: Wright Karl (Nokia-S/Cambridge)
>> Cc: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>;
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>
>> Subject: Re: Solr and LCF security at query time
>>
>> Karl,
>>
>> Thank you for your reply.
>>
>> I made some research today and I found this :
>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
>> http://demo.findwise.se:8880/SolrSecurity/
>>
>> Sorl security model have to be able to filter result list with items
>> coming from various sources at the same time (livelink, documentum, file
>> system, ...). Big subject :)
>>
>> Dominique
>>
>>
>> Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a
>> écrit :
>> Hi Dominique,
>>
>> At the moment, in order to enforce the LCF security model within
>> Lucene/Solr, you will need to build this functionality into whatever client
>> you are using to display the Lucene search results.  Specifically, you would
>> need to take the following steps:
>>
>> (1) Have your users access your search client through Apache.
>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
>> the client webapp.
>> (3) Have your client webapp alter whatever queries it is doing, to add an
>> appropriate query clause for each of the access tokens transmitted in the
>> headers.
>>
>> (This is how it is done at MetaCarta.)
>>
>> Alternatively, you may find a way to do this completely with a web
>> application under a Java app server such as Tomcat.  I have not yet done the
>> research to find out whether this is a feasible alternative.  Effectively,
>> what you need something like mod_auth_kerb to do is to authenticate your
>> user against Active Directory, or whomever the authenticator ought to be.
>>  JAAS may be helpful here.
>>
>> There are, of course, intentions to fill out the missing pieces more
>> completely and transparently via a Solr search plugin and/or filter.  What
>> has been lacking is time.  If you are in a position to do development in
>> this area, we're happy to have any assistance you might provide.
>>
>> Thanks,
>> Karl
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> Sent: Tuesday, April 20, 2010 5:06 AM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>>  Subject: Solr and LCF security at query time
>>
>> Hi,
>>
>> I don't see in LCF wiki how Solr and LCF works together at query time in
>> order to remove from the result list the items the user is not allowed to
>> access.
>>
>> In
>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
>> I just see these sentences :
>>
>> " Once all these documents and their access tokens are handed to the
>> search engine, it is the search engine's job to enforce security by
>> excluding inappropriate documents from the search results. For Lucene, this
>> infrastructure is expected to be built on top of Lucene's generic metadata
>> abilities, but has not been implemented at this time."
>>
>> I am not sure to understand. Does this mean that for the moment, it is not
>> possible for Solr to apply security by using an Authority Connector ?
>>
>> Dominique
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Thanks very much for your detailed explanation - really good!

As I've thought through some of the implications, I've added comments below,
so I hope they don't seem too jumbled...

I suppose on the 'authority' side, it works kind of as I envisioned it
would.

For general Solr access control, there's two layers of security that need to
be addressed:
  1. Authentication - make sure the incoming query is from a valid user, and
the passed-in credentials (hash, certificate etc.) are correct
  2. Query filtering - potentially reduce the number/type of returned
results based on the allow/deny metadata for the authenticated user

I can see how the LCF auth connector works for 2., but can it do 1. as well?
It would be good if this could somehow be integrated into any container
(Tomcat/Jetty et al) authentication that might be configured (probably
related to your previous post). I many ways, it could/should be that the
Authority (AD) part of the connector should only be concerned with 1. and
not 2. (see below).

So, on the repository side, there is also an LCF connector that 'closes the
loop' to provide the 'what is it I'm trying to control' side of things.
I understand that LCF doesn't do the mapping - it delegates this task to the
caller, but provides both sides of the equation (authority & repository).

>>>>>
- Each file in DirectoryA will have the following __ALLOW_TOKEN__document
attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
- Each file in DirectoryB will have the following __ALLOW_TOKEN__document
attributes inside Solr: "myAD:S-123-456-76890"
<<<<<
I think this is the bit that is worrying me - is this storing the SIDs into
Solr at document index time? This would be a problem for a whole load of
reasons, but maybe I'm missing something here? (see below for a possible
alternative)

Basically, what I'm getting at here is that the allow/deny values need to be
stored in one of three places:
  1. In the authority (e.g. inside AD)
  2. In the document metadata (index-time)
  3. In external storage (e.g. acl.xml, NTFS etc.)

1. Extending AD is pretty much out, as this causes too many interop problems
2. 'Hard-coding' acl information in the index makes it non-portable,
resistent to changes, etc.
3. acl.xml is coupled with a Solr instance, but is easily ported/replicated.
Storing/retrieving acl information from the source (e.g. NTFS) is
problematic, as the source may not be accessible (it may not even exist).

I believe 3. or a variant is the way to go on the repo side, which means the
LCF Authority connector is mainly for Authentication (see above), which is
what you want from AD et al integration.
The problem that arises from 'pluggable' authentication is that, if you're
not using a certificate, you have to start with a password, but the
connector only has access to the password hash (unless the pwd is sent in
the query url). I don't know of a way to confirm identities in AD using only
the username and hash (AD does the hash compare). I believe this is where
container-based integration will likely work better.

So that I can confirm my understanding...a scenario might be like this:

We have an AD connector that fetches the SIDs and we can read them etc.
For my environment, where there are no 'files' (there's only a transient
network stream), we have an LCF 'Solr Field Filter Query' connector that
decides which Filter Queries to apply (allow and deny) for the passed in
SID(s).

For another environment, let's say, NTFS, there might be an 'NTFS' connector
that would provide some kind of mapping of files/folders to SID(s). Since
Solr wouldn't intrinscially know about this, the acl information would need
to be stored somewhere in the index. This would mean extending the Solr
schema and storing metadata at index time.
The alternative is to re-use the 'Solr Field Filter Query' connector for
this as well (and any other document types that might be read in). This
keeps the index 'clean' of acl-specific metadata, and allows for in-place
changes and easy cross-document/index/instance access control.


If the above interpretation is [roughly] correct (please let me know if I've
got this wrong!), this would reduce down to having:
   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
(possibly/partly at the container level)
   2. At least an LCF Repository connector for 'acl.xml'
   3. Optional other LCF Repository connectors

It sounds like you've now finished the first half of 1. by adding the
ability to get the required auth data from a Solr api call. The other half
of 1. will be implementing the LCF interface in the SolrACLSecurity class,
to effectively replace the 'user', 'group' and 'password' bits of acl.xml.

Does the above sound like an accurate interpretation? Just trying to get a
good picture of what work needs doing, where it goes, etc.

Many thanks!
Peter




On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:

>  >>>>>>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an access_token
> value might contain?)
> <<<<<<
>
> Documents have access/deny attributes; authorities simply provide the list
> of tokens that belong to an authenticated user.  Thus, there's no
> access/deny for an authority; that's attached to the document (as it is in
> real-world repositories).
>
> Let's run a quick example, using Active Directory and a Windows file
> system.  Suppose that you have a directory with documents in it, call it
> DirectoryA, and the directory allows read access to the following SIDs:
>
> S-123-456-76890
> S-23-64-12345
>
> These SIDs correspond to active directory groups, let's call them Group1
> and Group2, respectively.
>
> DirectoryB also has documents in it, and those documents have just the SID
> S-123-456-76890 attached, because only Group1 can read its contents.
>
> Now, pretend that someone has created an LCF Active Directory authority
> connection (in the LCF UI), which is called "myAD", and this connection is
> set up to talk to the governing AD domain controller for this Windows file
> system.  We now know enough to describe the document indexing process:
>
> - Each file in DirectoryA will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890"
>
> Now, suppose that a user (let's call him "Peter") is authenticated with the
> AD domain controller.  Peter belongs to Group2, so his SIDs are (say):
>
> S-1-1-0 (the 'everyone' SID)
> S-323-999-12345 (his own personal user SID)
> S-23-64-12345 (the SID he gets because he belongs to group 2)
>
> We want to look up the documents in the search index that he can see.  So,
> we ask the LCF authority service what his tokens are, and we get back:
>
> "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
>
> The documents we should return in his search are the ones matching his
> search criteria, PLUS the intersection of his tokens with the document ALLOW
> tokens, MINUS the intersection of his tokens with the document DENY tokens
> (there aren't any involved in this example).  So only files that have one
> of his three tokens as an ALLOW attribute would be returned.
>
> Note that what we are attempting to do is enforce AD's security with the
> search results we present.  There is no need to define a whole new security
> mechanism, because AD already has one that people use.
>
> >>>>>>
> One of the key requirements I've worked to adhere to in SOLR-1872 is to
> ensure there are no security or other dependencies of indexed data with any
> external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is that
> Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes from
> files.
> <<<<<<
>
> LCF is all about abstracting from repositories.  It's not specifically
> about a file system, although that is a convenient example.  If you are
> building your own kind of repository with your own security setup, that's
> fine - but in the LCF world you'd need to create an authority connector for
> your repository (which maybe reads your acl.xml file), as well as a
> repository connector (which hands documents to LCF and provides it with the
> access tokens that make security work).  Of course, you can something much
> lighter that doesn't include LCF at all if you are just integrating a custom
> repository of your own, but it sounded like you were interested in the
> broader problem here.
>
> So, LCF doesn't do "acl mapping" at all.  It relies on its various
> connectors to work cooperatively to define access tokens in a way that is
> consistent from authority connector to repository connector for a given
> repository kind.  Anybody can write a connector, so the beauty of all this
> is that you can build a system where data from many disparate sources is
> indexed, and security for each is simultaneously enforced.
>
> Karl
>
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Thursday, April 22, 2010 9:24 AM
>
> *To:* dev@lucene.apache.org
> *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for the diagram -
> Sorry about all the questions, but this raises a few new ones...
>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an access_token
> value might contain?)
>
> One of the key requirements I've worked to adhere to in SOLR-1872 is to
> ensure there are no security or other dependencies of indexed data with any
> external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is that
> Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes from
> files.
>
> This is one reason why SOLR-1872 uses filter queries for its access/deny
> tokens - so that all the required information for access control completely
> resides within the Solr index itself.
> Is the LCF architecture acl 'mapping' between Solr fields (queries) and
> users, some external 'repository' (files) and users, or arbitrary data (e.g.
> either of these)?
>
> I hope that makes sense...
>
> Thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
>
>> Hi Peter,
>>
>> I've attached a diagram that is not in the wiki as of yet, and I'll try to
>> answer your questions.
>>
>> >>>>>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
>> particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does AD
>> needs its schema extended?)
>> Presumably, any such AD fields would need to be queried for effective
>> rights in order to cater for group membership allows and denies.
>> <<<<<<
>>
>> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
>> strings that represent a contract between an LCF authority connection and
>> the LCF repository connection that picks up the documents (from wherever).
>>  These tokens thus have no real meaning outside of LCF.  You must regard
>> them as opaque.
>>
>> The contract, however, states that if you use the LCF authority service to
>> obtain tokens for an authenticated user, you will get back a set that is
>> CONSISTENT with the tokens that were attached to the documents LCF sent to
>> Solr for indexing in the first place.  So, you don't have to worry about it,
>> and that's kind of the idea.  So you imagine the following flow:
>>
>> (1) Use LCF to fetch documents and send them to Solr
>> (2) When searching, use the LCF authority service to get the desired
>> user's access tokens
>> (3) Either filter the results, or modify the query, to be sure the access
>> tokens all match up properly
>>
>> For the AD authority, the LCF access tokens consist, in part, of the
>> user's SIDs.  For other authorities, the access tokens are wildly different.
>>  You really don't want to know what's in them, since that's the job of the
>> LCF authority to determine. ;-)
>>
>> LCF is not, by the way, joined at the hip with AD.  However, in practice,
>> most enterprises in the world use some form of AD single signon for their
>> web applications, and even if they're using some repository with its own
>> idea of security, there's a mapping between the AD users and the
>> repository's users.  Doing that mapping is also the job of the LCF authority
>> for that repository.
>>
>> Hope this helps.  Also, I'm not expecting time miracles here, so don't
>> sweat the schedule.
>>
>>
>> Karl
>>
>>
>> ________________________________________
>> From: ext Peter Sturge [peter.sturge@googlemail.com]
>> Sent: Thursday, April 22, 2010 4:27 AM
>> To: dev@lucene.apache.org
>> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> connectors-dev@incubator.apache.org
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the quick turnaround.
>> I'm in the middle of a product release for us, so I fear I won't be as
>> quick as you... :-)
>>
>> I couldn't find a simple flow diagram or similar for LCF with regards
>> security (probably looking in the wrong place).
>> Perhaps you could help on these questions...?
>>
>> In SOLR-1872, the allows and denies are stored (in acl.xml) as
>> sub-queries, which are then used as filter queries in a user's search.
>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
>> particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does AD
>> needs its schema extended?)
>> Presumably, any such AD fields would need to be queried for effective
>> rights in order to cater for group membership allows and denies.
>>
>> I guess I'm just trying to understand the architectural
>> flow/storage/retrieval of data in the various parts of the system, but I
>> admit, I need to do more research on this.
>> After our product release, when I get a few more spare cycles, I can look
>> at it in more detail.
>>
>> Many thanks!
>> Peter
>>
>>
>>
>> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I just committed the promised changes to the LCF Solr output connector.
>>
>> ACL metadata will now be posted to the Solr Http interface along with the
>> document as the two following fields:
>>
>> __ACCESS_TOKEN__document
>> __DENY_TOKEN__document
>>
>> There will, of course, potentially be multiple values for each of these
>> two fields.
>>
>> Hope this helps,
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 6:51 PM
>>
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the info. I'll have a look at the link and try to take in as
>> much sugar as my insulin levels will handle...
>> It sounds like the necessary interface(s) are already in LCF - just a
>> matter of implementing them in the Solr 1872 plugin.
>> I'll need to digest the LCF stuff to get to grips with it..please bear
>> with me while I do that...
>>
>> When you say:
>>   The LCF solr output connection doesn't yet do this, but it is trivial
>> for me to make that happen.
>> Do you mean a mechanism by which solr.war can get url et al info from its
>> parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>>
>>
>> Thanks,
>> Peter
>>
>>
>>
>>
>> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I'm the principal committer for LCF, but I don't know as much about Solr
>> as I ought to, so it sounds like a potentially productive collaboration.
>>
>> LCF does exactly what you are looking for - the only issue at all is that
>> you need to fetch a URL from a webapp to get what you are looking for.  The
>> "plugs" are all inside LCF for different kinds of repositories.  Here's a
>> link that might help with drinking the LCF "koolaid", as it were:
>> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>>
>> The url would be something like this (on a locally installed tomcat-based
>> LCF instance):
>>
>>
>> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>>
>> ... and this fetch returns something like:
>>
>> TOKEN:xxxxxxx
>> TOKEN:yyyyyyy
>> TOKEN:zzzzzzz
>> ....
>>
>> ... which represent the amalgamated tokens for all of the defined
>> authorities, and by some strange coincidence ( ;-) ) are compatible with
>> certain pieces of metadata that have been passed into Solr with each
>> document - one set of Allow tokens, and a second set of Deny tokens.  The
>> LCF solr output connection doesn't yet do this, but it is trivial for me to
>> make that happen.
>>
>> Does this sound plausible to you?
>>
>> Karl
>>
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 5:41 PM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
>> dev@lucene.apache.org>
>>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Integrating LCF to get external token support for SOLR-1872 sounds very
>> interesting indeed. I don't know anything about LCF, but one of the things I
>> was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
>> 'pluggable' - i.e. it would just be one of a series of plugins that could be
>> used for obtaining back-end authentication information.
>>
>> If you're good with LCF, perhaps we could work together to build this in.
>> One of the first things would be defining an interface that would be as easy
>> as possible to plug LCF into. Have you any suggestions/insight on this
>> front?
>>
>> Many thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> SOLR-1872 looks exactly like what I was envisioning, from the search query
>> perspective, although instead of the acl xml file you specify LCF stipulates
>> you would dynamically query the lcf-authority-service servlet for the access
>> tokens themselves.  That would get you support for AD, Documentum, LiveLink,
>> Meridio, and Memex for free. It seems likely that this component could be
>> modified to work with LCF with minor effort.
>>
>> The missing component still seems to be AD authentication, which needs a
>> solution.
>>
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 10:44 AM
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> If you want to do this completely within Solr, have a look at:
>> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>>
>> Thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> FYI
>>
>> ________________________________
>> From: Wright Karl (Nokia-S/Cambridge)
>> Sent: Tuesday, April 20, 2010 8:16 AM
>> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
>> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>'; '
>> connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>'
>> Subject: RE: Solr and LCF security at query time
>>
>> Dominique,
>>
>> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes
>> a powerful multi-repository security model, even though it doesn't yet do
>> the final step of enforcing that model at the search end.  LCF allows you to
>> define multiple authorities to operate against disparate repositories, and
>> use the appropriate authority to secure any given document.  The solr people
>> are aware of this design, which addresses the issues raised by SOLR-1834
>> very nicely.  However, as I said before, time is a problem, and the work
>> still needs to be done.
>>
>> I suggest you read up on the actual security model of LCF, and perhaps
>> experiment with that and the SOLR-1834 contribution, to see if there is
>> common ground.  One thing we've learned at MetaCarta is that post-filtering
>> for security purposes is expensive, and it is better to modify the queries
>> themselves to restrict the results, if possible.  I'm not sure which
>> approach SOLR-1834 takes, although it sounds like it might be the filtering
>> approach.  Still, it would be better than nothing.
>>
>> Please let me know what you find out.
>>
>> Thanks,
>> Karl
>>
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
>> dominique.bejean@eolya.fr>]
>> Sent: Tuesday, April 20, 2010 8:03 AM
>> To: Wright Karl (Nokia-S/Cambridge)
>> Cc: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>;
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>
>> Subject: Re: Solr and LCF security at query time
>>
>> Karl,
>>
>> Thank you for your reply.
>>
>> I made some research today and I found this :
>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
>> http://demo.findwise.se:8880/SolrSecurity/
>>
>> Sorl security model have to be able to filter result list with items
>> coming from various sources at the same time (livelink, documentum, file
>> system, ...). Big subject :)
>>
>> Dominique
>>
>>
>> Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a
>> écrit :
>> Hi Dominique,
>>
>> At the moment, in order to enforce the LCF security model within
>> Lucene/Solr, you will need to build this functionality into whatever client
>> you are using to display the Lucene search results.  Specifically, you would
>> need to take the following steps:
>>
>> (1) Have your users access your search client through Apache.
>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
>> the client webapp.
>> (3) Have your client webapp alter whatever queries it is doing, to add an
>> appropriate query clause for each of the access tokens transmitted in the
>> headers.
>>
>> (This is how it is done at MetaCarta.)
>>
>> Alternatively, you may find a way to do this completely with a web
>> application under a Java app server such as Tomcat.  I have not yet done the
>> research to find out whether this is a feasible alternative.  Effectively,
>> what you need something like mod_auth_kerb to do is to authenticate your
>> user against Active Directory, or whomever the authenticator ought to be.
>>  JAAS may be helpful here.
>>
>> There are, of course, intentions to fill out the missing pieces more
>> completely and transparently via a Solr search plugin and/or filter.  What
>> has been lacking is time.  If you are in a position to do development in
>> this area, we're happy to have any assistance you might provide.
>>
>> Thanks,
>> Karl
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> Sent: Tuesday, April 20, 2010 5:06 AM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>>  Subject: Solr and LCF security at query time
>>
>> Hi,
>>
>> I don't see in LCF wiki how Solr and LCF works together at query time in
>> order to remove from the result list the items the user is not allowed to
>> access.
>>
>> In
>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
>> I just see these sentences :
>>
>> " Once all these documents and their access tokens are handed to the
>> search engine, it is the search engine's job to enforce security by
>> excluding inappropriate documents from the search results. For Lucene, this
>> infrastructure is expected to be built on top of Lucene's generic metadata
>> abilities, but has not been implemented at this time."
>>
>> I am not sure to understand. Does this mean that for the moment, it is not
>> possible for Solr to apply security by using an Authority Connector ?
>>
>> Dominique
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Thanks very much for your detailed explanation - really good!

As I've thought through some of the implications, I've added comments below,
so I hope they don't seem too jumbled...

I suppose on the 'authority' side, it works kind of as I envisioned it
would.

For general Solr access control, there's two layers of security that need to
be addressed:
  1. Authentication - make sure the incoming query is from a valid user, and
the passed-in credentials (hash, certificate etc.) are correct
  2. Query filtering - potentially reduce the number/type of returned
results based on the allow/deny metadata for the authenticated user

I can see how the LCF auth connector works for 2., but can it do 1. as well?
It would be good if this could somehow be integrated into any container
(Tomcat/Jetty et al) authentication that might be configured (probably
related to your previous post). I many ways, it could/should be that the
Authority (AD) part of the connector should only be concerned with 1. and
not 2. (see below).

So, on the repository side, there is also an LCF connector that 'closes the
loop' to provide the 'what is it I'm trying to control' side of things.
I understand that LCF doesn't do the mapping - it delegates this task to the
caller, but provides both sides of the equation (authority & repository).

>>>>>
- Each file in DirectoryA will have the following __ALLOW_TOKEN__document
attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
- Each file in DirectoryB will have the following __ALLOW_TOKEN__document
attributes inside Solr: "myAD:S-123-456-76890"
<<<<<
I think this is the bit that is worrying me - is this storing the SIDs into
Solr at document index time? This would be a problem for a whole load of
reasons, but maybe I'm missing something here? (see below for a possible
alternative)

Basically, what I'm getting at here is that the allow/deny values need to be
stored in one of three places:
  1. In the authority (e.g. inside AD)
  2. In the document metadata (index-time)
  3. In external storage (e.g. acl.xml, NTFS etc.)

1. Extending AD is pretty much out, as this causes too many interop problems
2. 'Hard-coding' acl information in the index makes it non-portable,
resistent to changes, etc.
3. acl.xml is coupled with a Solr instance, but is easily ported/replicated.
Storing/retrieving acl information from the source (e.g. NTFS) is
problematic, as the source may not be accessible (it may not even exist).

I believe 3. or a variant is the way to go on the repo side, which means the
LCF Authority connector is mainly for Authentication (see above), which is
what you want from AD et al integration.
The problem that arises from 'pluggable' authentication is that, if you're
not using a certificate, you have to start with a password, but the
connector only has access to the password hash (unless the pwd is sent in
the query url). I don't know of a way to confirm identities in AD using only
the username and hash (AD does the hash compare). I believe this is where
container-based integration will likely work better.

So that I can confirm my understanding...a scenario might be like this:

We have an AD connector that fetches the SIDs and we can read them etc.
For my environment, where there are no 'files' (there's only a transient
network stream), we have an LCF 'Solr Field Filter Query' connector that
decides which Filter Queries to apply (allow and deny) for the passed in
SID(s).

For another environment, let's say, NTFS, there might be an 'NTFS' connector
that would provide some kind of mapping of files/folders to SID(s). Since
Solr wouldn't intrinscially know about this, the acl information would need
to be stored somewhere in the index. This would mean extending the Solr
schema and storing metadata at index time.
The alternative is to re-use the 'Solr Field Filter Query' connector for
this as well (and any other document types that might be read in). This
keeps the index 'clean' of acl-specific metadata, and allows for in-place
changes and easy cross-document/index/instance access control.


If the above interpretation is [roughly] correct (please let me know if I've
got this wrong!), this would reduce down to having:
   1. One or more LCF Authority connectors (e.g. AD, Documentum, etc.)
(possibly/partly at the container level)
   2. At least an LCF Repository connector for 'acl.xml'
   3. Optional other LCF Repository connectors

It sounds like you've now finished the first half of 1. by adding the
ability to get the required auth data from a Solr api call. The other half
of 1. will be implementing the LCF interface in the SolrACLSecurity class,
to effectively replace the 'user', 'group' and 'password' bits of acl.xml.

Does the above sound like an accurate interpretation? Just trying to get a
good picture of what work needs doing, where it goes, etc.

Many thanks!
Peter




On Thu, Apr 22, 2010 at 2:52 PM, <ka...@nokia.com> wrote:

>  >>>>>>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an access_token
> value might contain?)
> <<<<<<
>
> Documents have access/deny attributes; authorities simply provide the list
> of tokens that belong to an authenticated user.  Thus, there's no
> access/deny for an authority; that's attached to the document (as it is in
> real-world repositories).
>
> Let's run a quick example, using Active Directory and a Windows file
> system.  Suppose that you have a directory with documents in it, call it
> DirectoryA, and the directory allows read access to the following SIDs:
>
> S-123-456-76890
> S-23-64-12345
>
> These SIDs correspond to active directory groups, let's call them Group1
> and Group2, respectively.
>
> DirectoryB also has documents in it, and those documents have just the SID
> S-123-456-76890 attached, because only Group1 can read its contents.
>
> Now, pretend that someone has created an LCF Active Directory authority
> connection (in the LCF UI), which is called "myAD", and this connection is
> set up to talk to the governing AD domain controller for this Windows file
> system.  We now know enough to describe the document indexing process:
>
> - Each file in DirectoryA will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
> - Each file in DirectoryB will have the following __ALLOW_TOKEN__document
> attributes inside Solr: "myAD:S-123-456-76890"
>
> Now, suppose that a user (let's call him "Peter") is authenticated with the
> AD domain controller.  Peter belongs to Group2, so his SIDs are (say):
>
> S-1-1-0 (the 'everyone' SID)
> S-323-999-12345 (his own personal user SID)
> S-23-64-12345 (the SID he gets because he belongs to group 2)
>
> We want to look up the documents in the search index that he can see.  So,
> we ask the LCF authority service what his tokens are, and we get back:
>
> "myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"
>
> The documents we should return in his search are the ones matching his
> search criteria, PLUS the intersection of his tokens with the document ALLOW
> tokens, MINUS the intersection of his tokens with the document DENY tokens
> (there aren't any involved in this example).  So only files that have one
> of his three tokens as an ALLOW attribute would be returned.
>
> Note that what we are attempting to do is enforce AD's security with the
> search results we present.  There is no need to define a whole new security
> mechanism, because AD already has one that people use.
>
> >>>>>>
> One of the key requirements I've worked to adhere to in SOLR-1872 is to
> ensure there are no security or other dependencies of indexed data with any
> external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is that
> Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes from
> files.
> <<<<<<
>
> LCF is all about abstracting from repositories.  It's not specifically
> about a file system, although that is a convenient example.  If you are
> building your own kind of repository with your own security setup, that's
> fine - but in the LCF world you'd need to create an authority connector for
> your repository (which maybe reads your acl.xml file), as well as a
> repository connector (which hands documents to LCF and provides it with the
> access tokens that make security work).  Of course, you can something much
> lighter that doesn't include LCF at all if you are just integrating a custom
> repository of your own, but it sounded like you were interested in the
> broader problem here.
>
> So, LCF doesn't do "acl mapping" at all.  It relies on its various
> connectors to work cooperatively to define access tokens in a way that is
> consistent from authority connector to repository connector for a given
> repository kind.  Anybody can write a connector, so the beauty of all this
> is that you can build a system where data from many disparate sources is
> indexed, and security for each is simultaneously enforced.
>
> Karl
>
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Thursday, April 22, 2010 9:24 AM
>
> *To:* dev@lucene.apache.org
> *Cc:* connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks very much for the diagram -
> Sorry about all the questions, but this raises a few new ones...
>
> What is the relationship between stored data (documents) and authorities'
> access/deny attributes? (do you have any examples of what an access_token
> value might contain?)
>
> One of the key requirements I've worked to adhere to in SOLR-1872 is to
> ensure there are no security or other dependencies of indexed data with any
> external repository - most notably the file system.
> There are many reasons for wanting this, but one of the main ones is that
> Solr-stored data is not always based on file data (or accessible file data).
> In fact, in my particular case, almost none of the indexed data comes from
> files.
>
> This is one reason why SOLR-1872 uses filter queries for its access/deny
> tokens - so that all the required information for access control completely
> resides within the Solr index itself.
> Is the LCF architecture acl 'mapping' between Solr fields (queries) and
> users, some external 'repository' (files) and users, or arbitrary data (e.g.
> either of these)?
>
> I hope that makes sense...
>
> Thanks!
> Peter
>
>
>
>
> On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:
>
>> Hi Peter,
>>
>> I've attached a diagram that is not in the wiki as of yet, and I'll try to
>> answer your questions.
>>
>> >>>>>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
>> particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does AD
>> needs its schema extended?)
>> Presumably, any such AD fields would need to be queried for effective
>> rights in order to cater for group membership allows and denies.
>> <<<<<<
>>
>> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary
>> strings that represent a contract between an LCF authority connection and
>> the LCF repository connection that picks up the documents (from wherever).
>>  These tokens thus have no real meaning outside of LCF.  You must regard
>> them as opaque.
>>
>> The contract, however, states that if you use the LCF authority service to
>> obtain tokens for an authenticated user, you will get back a set that is
>> CONSISTENT with the tokens that were attached to the documents LCF sent to
>> Solr for indexing in the first place.  So, you don't have to worry about it,
>> and that's kind of the idea.  So you imagine the following flow:
>>
>> (1) Use LCF to fetch documents and send them to Solr
>> (2) When searching, use the LCF authority service to get the desired
>> user's access tokens
>> (3) Either filter the results, or modify the query, to be sure the access
>> tokens all match up properly
>>
>> For the AD authority, the LCF access tokens consist, in part, of the
>> user's SIDs.  For other authorities, the access tokens are wildly different.
>>  You really don't want to know what's in them, since that's the job of the
>> LCF authority to determine. ;-)
>>
>> LCF is not, by the way, joined at the hip with AD.  However, in practice,
>> most enterprises in the world use some form of AD single signon for their
>> web applications, and even if they're using some repository with its own
>> idea of security, there's a mapping between the AD users and the
>> repository's users.  Doing that mapping is also the job of the LCF authority
>> for that repository.
>>
>> Hope this helps.  Also, I'm not expecting time miracles here, so don't
>> sweat the schedule.
>>
>>
>> Karl
>>
>>
>> ________________________________________
>> From: ext Peter Sturge [peter.sturge@googlemail.com]
>> Sent: Thursday, April 22, 2010 4:27 AM
>> To: dev@lucene.apache.org
>> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
>> connectors-dev@incubator.apache.org
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the quick turnaround.
>> I'm in the middle of a product release for us, so I fear I won't be as
>> quick as you... :-)
>>
>> I couldn't find a simple flow diagram or similar for LCF with regards
>> security (probably looking in the wrong place).
>> Perhaps you could help on these questions...?
>>
>> In SOLR-1872, the allows and denies are stored (in acl.xml) as
>> sub-queries, which are then used as filter queries in a user's search.
>>
>> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
>> particular user in the underlying acl store (e.g. Active Directory)?
>> How does AD and/or LCF handle storing such data in its schema? (does AD
>> needs its schema extended?)
>> Presumably, any such AD fields would need to be queried for effective
>> rights in order to cater for group membership allows and denies.
>>
>> I guess I'm just trying to understand the architectural
>> flow/storage/retrieval of data in the various parts of the system, but I
>> admit, I need to do more research on this.
>> After our product release, when I get a few more spare cycles, I can look
>> at it in more detail.
>>
>> Many thanks!
>> Peter
>>
>>
>>
>> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I just committed the promised changes to the LCF Solr output connector.
>>
>> ACL metadata will now be posted to the Solr Http interface along with the
>> document as the two following fields:
>>
>> __ACCESS_TOKEN__document
>> __DENY_TOKEN__document
>>
>> There will, of course, potentially be multiple values for each of these
>> two fields.
>>
>> Hope this helps,
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 6:51 PM
>>
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Thanks for the info. I'll have a look at the link and try to take in as
>> much sugar as my insulin levels will handle...
>> It sounds like the necessary interface(s) are already in LCF - just a
>> matter of implementing them in the Solr 1872 plugin.
>> I'll need to digest the LCF stuff to get to grips with it..please bear
>> with me while I do that...
>>
>> When you say:
>>   The LCF solr output connection doesn't yet do this, but it is trivial
>> for me to make that happen.
>> Do you mean a mechanism by which solr.war can get url et al info from its
>> parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>>
>>
>> Thanks,
>> Peter
>>
>>
>>
>>
>> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> Hi Peter,
>>
>> I'm the principal committer for LCF, but I don't know as much about Solr
>> as I ought to, so it sounds like a potentially productive collaboration.
>>
>> LCF does exactly what you are looking for - the only issue at all is that
>> you need to fetch a URL from a webapp to get what you are looking for.  The
>> "plugs" are all inside LCF for different kinds of repositories.  Here's a
>> link that might help with drinking the LCF "koolaid", as it were:
>> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>>
>> The url would be something like this (on a locally installed tomcat-based
>> LCF instance):
>>
>>
>> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>>
>> ... and this fetch returns something like:
>>
>> TOKEN:xxxxxxx
>> TOKEN:yyyyyyy
>> TOKEN:zzzzzzz
>> ....
>>
>> ... which represent the amalgamated tokens for all of the defined
>> authorities, and by some strange coincidence ( ;-) ) are compatible with
>> certain pieces of metadata that have been passed into Solr with each
>> document - one set of Allow tokens, and a second set of Deny tokens.  The
>> LCF solr output connection doesn't yet do this, but it is trivial for me to
>> make that happen.
>>
>> Does this sound plausible to you?
>>
>> Karl
>>
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 5:41 PM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
>> dev@lucene.apache.org>
>>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> Hi Karl,
>>
>> Integrating LCF to get external token support for SOLR-1872 sounds very
>> interesting indeed. I don't know anything about LCF, but one of the things I
>> was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
>> 'pluggable' - i.e. it would just be one of a series of plugins that could be
>> used for obtaining back-end authentication information.
>>
>> If you're good with LCF, perhaps we could work together to build this in.
>> One of the first things would be defining an interface that would be as easy
>> as possible to plug LCF into. Have you any suggestions/insight on this
>> front?
>>
>> Many thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> SOLR-1872 looks exactly like what I was envisioning, from the search query
>> perspective, although instead of the acl xml file you specify LCF stipulates
>> you would dynamically query the lcf-authority-service servlet for the access
>> tokens themselves.  That would get you support for AD, Documentum, LiveLink,
>> Meridio, and Memex for free. It seems likely that this component could be
>> modified to work with LCF with minor effort.
>>
>> The missing component still seems to be AD authentication, which needs a
>> solution.
>>
>> Karl
>>
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
>> peter.sturge@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 10:44 AM
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>>
>> If you want to do this completely within Solr, have a look at:
>> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>>
>> Thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
>> karl.wright@nokia.com>> wrote:
>> FYI
>>
>> ________________________________
>> From: Wright Karl (Nokia-S/Cambridge)
>> Sent: Tuesday, April 20, 2010 8:16 AM
>> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
>> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>'; '
>> connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>'
>> Subject: RE: Solr and LCF security at query time
>>
>> Dominique,
>>
>> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes
>> a powerful multi-repository security model, even though it doesn't yet do
>> the final step of enforcing that model at the search end.  LCF allows you to
>> define multiple authorities to operate against disparate repositories, and
>> use the appropriate authority to secure any given document.  The solr people
>> are aware of this design, which addresses the issues raised by SOLR-1834
>> very nicely.  However, as I said before, time is a problem, and the work
>> still needs to be done.
>>
>> I suggest you read up on the actual security model of LCF, and perhaps
>> experiment with that and the SOLR-1834 contribution, to see if there is
>> common ground.  One thing we've learned at MetaCarta is that post-filtering
>> for security purposes is expensive, and it is better to modify the queries
>> themselves to restrict the results, if possible.  I'm not sure which
>> approach SOLR-1834 takes, although it sounds like it might be the filtering
>> approach.  Still, it would be better than nothing.
>>
>> Please let me know what you find out.
>>
>> Thanks,
>> Karl
>>
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
>> dominique.bejean@eolya.fr>]
>> Sent: Tuesday, April 20, 2010 8:03 AM
>> To: Wright Karl (Nokia-S/Cambridge)
>> Cc: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>;
>> connectors-dev@incubator.apache.org<mailto:
>> connectors-dev@incubator.apache.org>
>> Subject: Re: Solr and LCF security at query time
>>
>> Karl,
>>
>> Thank you for your reply.
>>
>> I made some research today and I found this :
>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
>> http://demo.findwise.se:8880/SolrSecurity/
>>
>> Sorl security model have to be able to filter result list with items
>> coming from various sources at the same time (livelink, documentum, file
>> system, ...). Big subject :)
>>
>> Dominique
>>
>>
>> Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a
>> écrit :
>> Hi Dominique,
>>
>> At the moment, in order to enforce the LCF security model within
>> Lucene/Solr, you will need to build this functionality into whatever client
>> you are using to display the Lucene search results.  Specifically, you would
>> need to take the following steps:
>>
>> (1) Have your users access your search client through Apache.
>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
>> the client webapp.
>> (3) Have your client webapp alter whatever queries it is doing, to add an
>> appropriate query clause for each of the access tokens transmitted in the
>> headers.
>>
>> (This is how it is done at MetaCarta.)
>>
>> Alternatively, you may find a way to do this completely with a web
>> application under a Java app server such as Tomcat.  I have not yet done the
>> research to find out whether this is a feasible alternative.  Effectively,
>> what you need something like mod_auth_kerb to do is to authenticate your
>> user against Active Directory, or whomever the authenticator ought to be.
>>  JAAS may be helpful here.
>>
>> There are, of course, intentions to fill out the missing pieces more
>> completely and transparently via a Solr search plugin and/or filter.  What
>> has been lacking is time.  If you are in a position to do development in
>> this area, we're happy to have any assistance you might provide.
>>
>> Thanks,
>> Karl
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> Sent: Tuesday, April 20, 2010 5:06 AM
>> To: connectors-user@incubator.apache.org<mailto:
>> connectors-user@incubator.apache.org>
>>  Subject: Solr and LCF security at query time
>>
>> Hi,
>>
>> I don't see in LCF wiki how Solr and LCF works together at query time in
>> order to remove from the result list the items the user is not allowed to
>> access.
>>
>> In
>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
>> I just see these sentences :
>>
>> " Once all these documents and their access tokens are handed to the
>> search engine, it is the search engine's job to enforce security by
>> excluding inappropriate documents from the search results. For Lucene, this
>> infrastructure is expected to be built on top of Lucene's generic metadata
>> abilities, but has not been implemented at this time."
>>
>> I am not sure to understand. Does this mean that for the moment, it is not
>> possible for Solr to apply security by using an Authority Connector ?
>>
>> Dominique
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
>>>>>>
What is the relationship between stored data (documents) and authorities' access/deny attributes? (do you have any examples of what an access_token value might contain?)
<<<<<<

Documents have access/deny attributes; authorities simply provide the list of tokens that belong to an authenticated user.  Thus, there's no access/deny for an authority; that's attached to the document (as it is in real-world repositories).

Let's run a quick example, using Active Directory and a Windows file system.  Suppose that you have a directory with documents in it, call it DirectoryA, and the directory allows read access to the following SIDs:

S-123-456-76890
S-23-64-12345

These SIDs correspond to active directory groups, let's call them Group1 and Group2, respectively.

DirectoryB also has documents in it, and those documents have just the SID S-123-456-76890 attached, because only Group1 can read its contents.

Now, pretend that someone has created an LCF Active Directory authority connection (in the LCF UI), which is called "myAD", and this connection is set up to talk to the governing AD domain controller for this Windows file system.  We now know enough to describe the document indexing process:

- Each file in DirectoryA will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
- Each file in DirectoryB will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"

Now, suppose that a user (let's call him "Peter") is authenticated with the AD domain controller.  Peter belongs to Group2, so his SIDs are (say):

S-1-1-0 (the 'everyone' SID)
S-323-999-12345 (his own personal user SID)
S-23-64-12345 (the SID he gets because he belongs to group 2)

We want to look up the documents in the search index that he can see.  So, we ask the LCF authority service what his tokens are, and we get back:

"myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"

The documents we should return in his search are the ones matching his search criteria, PLUS the intersection of his tokens with the document ALLOW tokens, MINUS the intersection of his tokens with the document DENY tokens (there aren't any involved in this example).  So only files that have one of his three tokens as an ALLOW attribute would be returned.

Note that what we are attempting to do is enforce AD's security with the search results we present.  There is no need to define a whole new security mechanism, because AD already has one that people use.

>>>>>>
One of the key requirements I've worked to adhere to in SOLR-1872 is to ensure there are no security or other dependencies of indexed data with any external repository - most notably the file system.
There are many reasons for wanting this, but one of the main ones is that Solr-stored data is not always based on file data (or accessible file data). In fact, in my particular case, almost none of the indexed data comes from files.
<<<<<<

LCF is all about abstracting from repositories.  It's not specifically about a file system, although that is a convenient example.  If you are building your own kind of repository with your own security setup, that's fine - but in the LCF world you'd need to create an authority connector for your repository (which maybe reads your acl.xml file), as well as a repository connector (which hands documents to LCF and provides it with the access tokens that make security work).  Of course, you can something much lighter that doesn't include LCF at all if you are just integrating a custom repository of your own, but it sounded like you were interested in the broader problem here.

So, LCF doesn't do "acl mapping" at all.  It relies on its various connectors to work cooperatively to define access tokens in a way that is consistent from authority connector to repository connector for a given repository kind.  Anybody can write a connector, so the beauty of all this is that you can build a system where data from many disparate sources is indexed, and security for each is simultaneously enforced.

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 9:24 AM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks very much for the diagram -
Sorry about all the questions, but this raises a few new ones...

What is the relationship between stored data (documents) and authorities' access/deny attributes? (do you have any examples of what an access_token value might contain?)

One of the key requirements I've worked to adhere to in SOLR-1872 is to ensure there are no security or other dependencies of indexed data with any external repository - most notably the file system.
There are many reasons for wanting this, but one of the main ones is that Solr-stored data is not always based on file data (or accessible file data). In fact, in my particular case, almost none of the indexed data comes from files.

This is one reason why SOLR-1872 uses filter queries for its access/deny tokens - so that all the required information for access control completely resides within the Solr index itself.
Is the LCF architecture acl 'mapping' between Solr fields (queries) and users, some external 'repository' (files) and users, or arbitrary data (e.g. either of these)?

I hope that makes sense...

Thanks!
Peter




On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com>> wrote:
Hi Peter,

I've attached a diagram that is not in the wiki as of yet, and I'll try to answer your questions.

>>>>>>
Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.
<<<<<<

The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings that represent a contract between an LCF authority connection and the LCF repository connection that picks up the documents (from wherever).  These tokens thus have no real meaning outside of LCF.  You must regard them as opaque.

The contract, however, states that if you use the LCF authority service to obtain tokens for an authenticated user, you will get back a set that is CONSISTENT with the tokens that were attached to the documents LCF sent to Solr for indexing in the first place.  So, you don't have to worry about it, and that's kind of the idea.  So you imagine the following flow:

(1) Use LCF to fetch documents and send them to Solr
(2) When searching, use the LCF authority service to get the desired user's access tokens
(3) Either filter the results, or modify the query, to be sure the access tokens all match up properly

For the AD authority, the LCF access tokens consist, in part, of the user's SIDs.  For other authorities, the access tokens are wildly different.  You really don't want to know what's in them, since that's the job of the LCF authority to determine. ;-)

LCF is not, by the way, joined at the hip with AD.  However, in practice, most enterprises in the world use some form of AD single signon for their web applications, and even if they're using some repository with its own idea of security, there's a mapping between the AD users and the repository's users.  Doing that mapping is also the job of the LCF authority for that repository.

Hope this helps.  Also, I'm not expecting time miracles here, so don't sweat the schedule.


Karl


________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 22, 2010 4:27 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the quick turnaround.
I'm in the middle of a product release for us, so I fear I won't be as quick as you... :-)

I couldn't find a simple flow diagram or similar for LCF with regards security (probably looking in the wrong place).
Perhaps you could help on these questions...?

In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries, which are then used as filter queries in a user's search.

Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.

I guess I'm just trying to understand the architectural flow/storage/retrieval of data in the various parts of the system, but I admit, I need to do more research on this.
After our product release, when I get a few more spare cycles, I can look at it in more detail.

Many thanks!
Peter



On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com>>> wrote:
Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>>]
Sent: Tuesday, April 20, 2010 6:51 PM

To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; dev@lucene.apache.org<ma...@lucene.apache.org>>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>>'
Cc: 'solr-dev@apache.org<ma...@apache.org>>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com>> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique






---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
>>>>>>
What is the relationship between stored data (documents) and authorities' access/deny attributes? (do you have any examples of what an access_token value might contain?)
<<<<<<

Documents have access/deny attributes; authorities simply provide the list of tokens that belong to an authenticated user.  Thus, there's no access/deny for an authority; that's attached to the document (as it is in real-world repositories).

Let's run a quick example, using Active Directory and a Windows file system.  Suppose that you have a directory with documents in it, call it DirectoryA, and the directory allows read access to the following SIDs:

S-123-456-76890
S-23-64-12345

These SIDs correspond to active directory groups, let's call them Group1 and Group2, respectively.

DirectoryB also has documents in it, and those documents have just the SID S-123-456-76890 attached, because only Group1 can read its contents.

Now, pretend that someone has created an LCF Active Directory authority connection (in the LCF UI), which is called "myAD", and this connection is set up to talk to the governing AD domain controller for this Windows file system.  We now know enough to describe the document indexing process:

- Each file in DirectoryA will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
- Each file in DirectoryB will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"

Now, suppose that a user (let's call him "Peter") is authenticated with the AD domain controller.  Peter belongs to Group2, so his SIDs are (say):

S-1-1-0 (the 'everyone' SID)
S-323-999-12345 (his own personal user SID)
S-23-64-12345 (the SID he gets because he belongs to group 2)

We want to look up the documents in the search index that he can see.  So, we ask the LCF authority service what his tokens are, and we get back:

"myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"

The documents we should return in his search are the ones matching his search criteria, PLUS the intersection of his tokens with the document ALLOW tokens, MINUS the intersection of his tokens with the document DENY tokens (there aren't any involved in this example).  So only files that have one of his three tokens as an ALLOW attribute would be returned.

Note that what we are attempting to do is enforce AD's security with the search results we present.  There is no need to define a whole new security mechanism, because AD already has one that people use.

>>>>>>
One of the key requirements I've worked to adhere to in SOLR-1872 is to ensure there are no security or other dependencies of indexed data with any external repository - most notably the file system.
There are many reasons for wanting this, but one of the main ones is that Solr-stored data is not always based on file data (or accessible file data). In fact, in my particular case, almost none of the indexed data comes from files.
<<<<<<

LCF is all about abstracting from repositories.  It's not specifically about a file system, although that is a convenient example.  If you are building your own kind of repository with your own security setup, that's fine - but in the LCF world you'd need to create an authority connector for your repository (which maybe reads your acl.xml file), as well as a repository connector (which hands documents to LCF and provides it with the access tokens that make security work).  Of course, you can something much lighter that doesn't include LCF at all if you are just integrating a custom repository of your own, but it sounded like you were interested in the broader problem here.

So, LCF doesn't do "acl mapping" at all.  It relies on its various connectors to work cooperatively to define access tokens in a way that is consistent from authority connector to repository connector for a given repository kind.  Anybody can write a connector, so the beauty of all this is that you can build a system where data from many disparate sources is indexed, and security for each is simultaneously enforced.

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 9:24 AM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks very much for the diagram -
Sorry about all the questions, but this raises a few new ones...

What is the relationship between stored data (documents) and authorities' access/deny attributes? (do you have any examples of what an access_token value might contain?)

One of the key requirements I've worked to adhere to in SOLR-1872 is to ensure there are no security or other dependencies of indexed data with any external repository - most notably the file system.
There are many reasons for wanting this, but one of the main ones is that Solr-stored data is not always based on file data (or accessible file data). In fact, in my particular case, almost none of the indexed data comes from files.

This is one reason why SOLR-1872 uses filter queries for its access/deny tokens - so that all the required information for access control completely resides within the Solr index itself.
Is the LCF architecture acl 'mapping' between Solr fields (queries) and users, some external 'repository' (files) and users, or arbitrary data (e.g. either of these)?

I hope that makes sense...

Thanks!
Peter




On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com>> wrote:
Hi Peter,

I've attached a diagram that is not in the wiki as of yet, and I'll try to answer your questions.

>>>>>>
Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.
<<<<<<

The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings that represent a contract between an LCF authority connection and the LCF repository connection that picks up the documents (from wherever).  These tokens thus have no real meaning outside of LCF.  You must regard them as opaque.

The contract, however, states that if you use the LCF authority service to obtain tokens for an authenticated user, you will get back a set that is CONSISTENT with the tokens that were attached to the documents LCF sent to Solr for indexing in the first place.  So, you don't have to worry about it, and that's kind of the idea.  So you imagine the following flow:

(1) Use LCF to fetch documents and send them to Solr
(2) When searching, use the LCF authority service to get the desired user's access tokens
(3) Either filter the results, or modify the query, to be sure the access tokens all match up properly

For the AD authority, the LCF access tokens consist, in part, of the user's SIDs.  For other authorities, the access tokens are wildly different.  You really don't want to know what's in them, since that's the job of the LCF authority to determine. ;-)

LCF is not, by the way, joined at the hip with AD.  However, in practice, most enterprises in the world use some form of AD single signon for their web applications, and even if they're using some repository with its own idea of security, there's a mapping between the AD users and the repository's users.  Doing that mapping is also the job of the LCF authority for that repository.

Hope this helps.  Also, I'm not expecting time miracles here, so don't sweat the schedule.


Karl


________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 22, 2010 4:27 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the quick turnaround.
I'm in the middle of a product release for us, so I fear I won't be as quick as you... :-)

I couldn't find a simple flow diagram or similar for LCF with regards security (probably looking in the wrong place).
Perhaps you could help on these questions...?

In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries, which are then used as filter queries in a user's search.

Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.

I guess I'm just trying to understand the architectural flow/storage/retrieval of data in the various parts of the system, but I admit, I need to do more research on this.
After our product release, when I get a few more spare cycles, I can look at it in more detail.

Many thanks!
Peter



On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com>>> wrote:
Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>>]
Sent: Tuesday, April 20, 2010 6:51 PM

To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; dev@lucene.apache.org<ma...@lucene.apache.org>>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>>'
Cc: 'solr-dev@apache.org<ma...@apache.org>>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com>> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique






---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
>>>>>>
What is the relationship between stored data (documents) and authorities' access/deny attributes? (do you have any examples of what an access_token value might contain?)
<<<<<<

Documents have access/deny attributes; authorities simply provide the list of tokens that belong to an authenticated user.  Thus, there's no access/deny for an authority; that's attached to the document (as it is in real-world repositories).

Let's run a quick example, using Active Directory and a Windows file system.  Suppose that you have a directory with documents in it, call it DirectoryA, and the directory allows read access to the following SIDs:

S-123-456-76890
S-23-64-12345

These SIDs correspond to active directory groups, let's call them Group1 and Group2, respectively.

DirectoryB also has documents in it, and those documents have just the SID S-123-456-76890 attached, because only Group1 can read its contents.

Now, pretend that someone has created an LCF Active Directory authority connection (in the LCF UI), which is called "myAD", and this connection is set up to talk to the governing AD domain controller for this Windows file system.  We now know enough to describe the document indexing process:

- Each file in DirectoryA will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
- Each file in DirectoryB will have the following __ALLOW_TOKEN__document attributes inside Solr: "myAD:S-123-456-76890"

Now, suppose that a user (let's call him "Peter") is authenticated with the AD domain controller.  Peter belongs to Group2, so his SIDs are (say):

S-1-1-0 (the 'everyone' SID)
S-323-999-12345 (his own personal user SID)
S-23-64-12345 (the SID he gets because he belongs to group 2)

We want to look up the documents in the search index that he can see.  So, we ask the LCF authority service what his tokens are, and we get back:

"myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"

The documents we should return in his search are the ones matching his search criteria, PLUS the intersection of his tokens with the document ALLOW tokens, MINUS the intersection of his tokens with the document DENY tokens (there aren't any involved in this example).  So only files that have one of his three tokens as an ALLOW attribute would be returned.

Note that what we are attempting to do is enforce AD's security with the search results we present.  There is no need to define a whole new security mechanism, because AD already has one that people use.

>>>>>>
One of the key requirements I've worked to adhere to in SOLR-1872 is to ensure there are no security or other dependencies of indexed data with any external repository - most notably the file system.
There are many reasons for wanting this, but one of the main ones is that Solr-stored data is not always based on file data (or accessible file data). In fact, in my particular case, almost none of the indexed data comes from files.
<<<<<<

LCF is all about abstracting from repositories.  It's not specifically about a file system, although that is a convenient example.  If you are building your own kind of repository with your own security setup, that's fine - but in the LCF world you'd need to create an authority connector for your repository (which maybe reads your acl.xml file), as well as a repository connector (which hands documents to LCF and provides it with the access tokens that make security work).  Of course, you can something much lighter that doesn't include LCF at all if you are just integrating a custom repository of your own, but it sounded like you were interested in the broader problem here.

So, LCF doesn't do "acl mapping" at all.  It relies on its various connectors to work cooperatively to define access tokens in a way that is consistent from authority connector to repository connector for a given repository kind.  Anybody can write a connector, so the beauty of all this is that you can build a system where data from many disparate sources is indexed, and security for each is simultaneously enforced.

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 9:24 AM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks very much for the diagram -
Sorry about all the questions, but this raises a few new ones...

What is the relationship between stored data (documents) and authorities' access/deny attributes? (do you have any examples of what an access_token value might contain?)

One of the key requirements I've worked to adhere to in SOLR-1872 is to ensure there are no security or other dependencies of indexed data with any external repository - most notably the file system.
There are many reasons for wanting this, but one of the main ones is that Solr-stored data is not always based on file data (or accessible file data). In fact, in my particular case, almost none of the indexed data comes from files.

This is one reason why SOLR-1872 uses filter queries for its access/deny tokens - so that all the required information for access control completely resides within the Solr index itself.
Is the LCF architecture acl 'mapping' between Solr fields (queries) and users, some external 'repository' (files) and users, or arbitrary data (e.g. either of these)?

I hope that makes sense...

Thanks!
Peter




On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com>> wrote:
Hi Peter,

I've attached a diagram that is not in the wiki as of yet, and I'll try to answer your questions.

>>>>>>
Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.
<<<<<<

The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings that represent a contract between an LCF authority connection and the LCF repository connection that picks up the documents (from wherever).  These tokens thus have no real meaning outside of LCF.  You must regard them as opaque.

The contract, however, states that if you use the LCF authority service to obtain tokens for an authenticated user, you will get back a set that is CONSISTENT with the tokens that were attached to the documents LCF sent to Solr for indexing in the first place.  So, you don't have to worry about it, and that's kind of the idea.  So you imagine the following flow:

(1) Use LCF to fetch documents and send them to Solr
(2) When searching, use the LCF authority service to get the desired user's access tokens
(3) Either filter the results, or modify the query, to be sure the access tokens all match up properly

For the AD authority, the LCF access tokens consist, in part, of the user's SIDs.  For other authorities, the access tokens are wildly different.  You really don't want to know what's in them, since that's the job of the LCF authority to determine. ;-)

LCF is not, by the way, joined at the hip with AD.  However, in practice, most enterprises in the world use some form of AD single signon for their web applications, and even if they're using some repository with its own idea of security, there's a mapping between the AD users and the repository's users.  Doing that mapping is also the job of the LCF authority for that repository.

Hope this helps.  Also, I'm not expecting time miracles here, so don't sweat the schedule.


Karl


________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Thursday, April 22, 2010 4:27 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; lucene-dev@apache.org<ma...@apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the quick turnaround.
I'm in the middle of a product release for us, so I fear I won't be as quick as you... :-)

I couldn't find a simple flow diagram or similar for LCF with regards security (probably looking in the wrong place).
Perhaps you could help on these questions...?

In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries, which are then used as filter queries in a user's search.

Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.

I guess I'm just trying to understand the architectural flow/storage/retrieval of data in the various parts of the system, but I admit, I need to do more research on this.
After our product release, when I get a few more spare cycles, I can look at it in more detail.

Many thanks!
Peter



On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com>>> wrote:
Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>>]
Sent: Tuesday, April 20, 2010 6:51 PM

To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; dev@lucene.apache.org<ma...@lucene.apache.org>>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>>'
Cc: 'solr-dev@apache.org<ma...@apache.org>>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com>> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique






---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>


Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Thanks very much for the diagram -
Sorry about all the questions, but this raises a few new ones...

What is the relationship between stored data (documents) and authorities'
access/deny attributes? (do you have any examples of what an access_token
value might contain?)

One of the key requirements I've worked to adhere to in SOLR-1872 is to
ensure there are no security or other dependencies of indexed data with any
external repository - most notably the file system.
There are many reasons for wanting this, but one of the main ones is that
Solr-stored data is not always based on file data (or accessible file data).
In fact, in my particular case, almost none of the indexed data comes from
files.

This is one reason why SOLR-1872 uses filter queries for its access/deny
tokens - so that all the required information for access control completely
resides within the Solr index itself.
Is the LCF architecture acl 'mapping' between Solr fields (queries) and
users, some external 'repository' (files) and users, or arbitrary data (e.g.
either of these)?

I hope that makes sense...

Thanks!
Peter




On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> I've attached a diagram that is not in the wiki as of yet, and I'll try to
> answer your questions.
>
> >>>>>>
> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
> particular user in the underlying acl store (e.g. Active Directory)?
> How does AD and/or LCF handle storing such data in its schema? (does AD
> needs its schema extended?)
> Presumably, any such AD fields would need to be queried for effective
> rights in order to cater for group membership allows and denies.
> <<<<<<
>
> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings
> that represent a contract between an LCF authority connection and the LCF
> repository connection that picks up the documents (from wherever).  These
> tokens thus have no real meaning outside of LCF.  You must regard them as
> opaque.
>
> The contract, however, states that if you use the LCF authority service to
> obtain tokens for an authenticated user, you will get back a set that is
> CONSISTENT with the tokens that were attached to the documents LCF sent to
> Solr for indexing in the first place.  So, you don't have to worry about it,
> and that's kind of the idea.  So you imagine the following flow:
>
> (1) Use LCF to fetch documents and send them to Solr
> (2) When searching, use the LCF authority service to get the desired user's
> access tokens
> (3) Either filter the results, or modify the query, to be sure the access
> tokens all match up properly
>
> For the AD authority, the LCF access tokens consist, in part, of the user's
> SIDs.  For other authorities, the access tokens are wildly different.  You
> really don't want to know what's in them, since that's the job of the LCF
> authority to determine. ;-)
>
> LCF is not, by the way, joined at the hip with AD.  However, in practice,
> most enterprises in the world use some form of AD single signon for their
> web applications, and even if they're using some repository with its own
> idea of security, there's a mapping between the AD users and the
> repository's users.  Doing that mapping is also the job of the LCF authority
> for that repository.
>
> Hope this helps.  Also, I'm not expecting time miracles here, so don't
> sweat the schedule.
>
>
> Karl
>
>
> ________________________________________
> From: ext Peter Sturge [peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 4:27 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks for the quick turnaround.
> I'm in the middle of a product release for us, so I fear I won't be as
> quick as you... :-)
>
> I couldn't find a simple flow diagram or similar for LCF with regards
> security (probably looking in the wrong place).
> Perhaps you could help on these questions...?
>
> In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries,
> which are then used as filter queries in a user's search.
>
> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
> particular user in the underlying acl store (e.g. Active Directory)?
> How does AD and/or LCF handle storing such data in its schema? (does AD
> needs its schema extended?)
> Presumably, any such AD fields would need to be queried for effective
> rights in order to cater for group membership allows and denies.
>
> I guess I'm just trying to understand the architectural
> flow/storage/retrieval of data in the various parts of the system, but I
> admit, I need to do more research on this.
> After our product release, when I get a few more spare cycles, I can look
> at it in more detail.
>
> Many thanks!
> Peter
>
>
>
> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> Hi Peter,
>
> I just committed the promised changes to the LCF Solr output connector.
>
> ACL metadata will now be posted to the Solr Http interface along with the
> document as the two following fields:
>
> __ACCESS_TOKEN__document
> __DENY_TOKEN__document
>
> There will, of course, potentially be multiple values for each of these two
> fields.
>
> Hope this helps,
> Karl
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> peter.sturge@googlemail.com>]
> Sent: Tuesday, April 20, 2010 6:51 PM
>
> To: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks for the info. I'll have a look at the link and try to take in as
> much sugar as my insulin levels will handle...
> It sounds like the necessary interface(s) are already in LCF - just a
> matter of implementing them in the Solr 1872 plugin.
> I'll need to digest the LCF stuff to get to grips with it..please bear with
> me while I do that...
>
> When you say:
>   The LCF solr output connection doesn't yet do this, but it is trivial for
> me to make that happen.
> Do you mean a mechanism by which solr.war can get url et al info from its
> parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>
>
> Thanks,
> Peter
>
>
>
>
> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> Hi Peter,
>
> I'm the principal committer for LCF, but I don't know as much about Solr as
> I ought to, so it sounds like a potentially productive collaboration.
>
> LCF does exactly what you are looking for - the only issue at all is that
> you need to fetch a URL from a webapp to get what you are looking for.  The
> "plugs" are all inside LCF for different kinds of repositories.  Here's a
> link that might help with drinking the LCF "koolaid", as it were:
> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>
> The url would be something like this (on a locally installed tomcat-based
> LCF instance):
>
>
> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>
> ... and this fetch returns something like:
>
> TOKEN:xxxxxxx
> TOKEN:yyyyyyy
> TOKEN:zzzzzzz
> ....
>
> ... which represent the amalgamated tokens for all of the defined
> authorities, and by some strange coincidence ( ;-) ) are compatible with
> certain pieces of metadata that have been passed into Solr with each
> document - one set of Allow tokens, and a second set of Deny tokens.  The
> LCF solr output connection doesn't yet do this, but it is trivial for me to
> make that happen.
>
> Does this sound plausible to you?
>
> Karl
>
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> peter.sturge@googlemail.com>]
> Sent: Tuesday, April 20, 2010 5:41 PM
> To: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> dev@lucene.apache.org>
>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Integrating LCF to get external token support for SOLR-1872 sounds very
> interesting indeed. I don't know anything about LCF, but one of the things I
> was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
> 'pluggable' - i.e. it would just be one of a series of plugins that could be
> used for obtaining back-end authentication information.
>
> If you're good with LCF, perhaps we could work together to build this in.
> One of the first things would be defining an interface that would be as easy
> as possible to plug LCF into. Have you any suggestions/insight on this
> front?
>
> Many thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> SOLR-1872 looks exactly like what I was envisioning, from the search query
> perspective, although instead of the acl xml file you specify LCF stipulates
> you would dynamically query the lcf-authority-service servlet for the access
> tokens themselves.  That would get you support for AD, Documentum, LiveLink,
> Meridio, and Memex for free. It seems likely that this component could be
> modified to work with LCF with minor effort.
>
> The missing component still seems to be AD authentication, which needs a
> solution.
>
> Karl
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> peter.sturge@googlemail.com>]
> Sent: Tuesday, April 20, 2010 10:44 AM
> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> If you want to do this completely within Solr, have a look at:
> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>
> Thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> FYI
>
> ________________________________
> From: Wright Karl (Nokia-S/Cambridge)
> Sent: Tuesday, April 20, 2010 8:16 AM
> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> connectors-dev@incubator.apache.org<mailto:
> connectors-dev@incubator.apache.org>'; '
> connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>'
> Subject: RE: Solr and LCF security at query time
>
> Dominique,
>
> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a
> powerful multi-repository security model, even though it doesn't yet do the
> final step of enforcing that model at the search end.  LCF allows you to
> define multiple authorities to operate against disparate repositories, and
> use the appropriate authority to secure any given document.  The solr people
> are aware of this design, which addresses the issues raised by SOLR-1834
> very nicely.  However, as I said before, time is a problem, and the work
> still needs to be done.
>
> I suggest you read up on the actual security model of LCF, and perhaps
> experiment with that and the SOLR-1834 contribution, to see if there is
> common ground.  One thing we've learned at MetaCarta is that post-filtering
> for security purposes is expensive, and it is better to modify the queries
> themselves to restrict the results, if possible.  I'm not sure which
> approach SOLR-1834 takes, although it sounds like it might be the filtering
> approach.  Still, it would be better than nothing.
>
> Please let me know what you find out.
>
> Thanks,
> Karl
>
> ________________________________
> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> dominique.bejean@eolya.fr>]
> Sent: Tuesday, April 20, 2010 8:03 AM
> To: Wright Karl (Nokia-S/Cambridge)
> Cc: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>; connectors-dev@incubator.apache.org
> <ma...@incubator.apache.org>
> Subject: Re: Solr and LCF security at query time
>
> Karl,
>
> Thank you for your reply.
>
> I made some research today and I found this :
> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
> http://demo.findwise.se:8880/SolrSecurity/
>
> Sorl security model have to be able to filter result list with items coming
> from various sources at the same time (livelink, documentum, file system,
> ...). Big subject :)
>
> Dominique
>
>
> Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a
> écrit :
> Hi Dominique,
>
> At the moment, in order to enforce the LCF security model within
> Lucene/Solr, you will need to build this functionality into whatever client
> you are using to display the Lucene search results.  Specifically, you would
> need to take the following steps:
>
> (1) Have your users access your search client through Apache.
> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
> the client webapp.
> (3) Have your client webapp alter whatever queries it is doing, to add an
> appropriate query clause for each of the access tokens transmitted in the
> headers.
>
> (This is how it is done at MetaCarta.)
>
> Alternatively, you may find a way to do this completely with a web
> application under a Java app server such as Tomcat.  I have not yet done the
> research to find out whether this is a feasible alternative.  Effectively,
> what you need something like mod_auth_kerb to do is to authenticate your
> user against Active Directory, or whomever the authenticator ought to be.
>  JAAS may be helpful here.
>
> There are, of course, intentions to fill out the missing pieces more
> completely and transparently via a Solr search plugin and/or filter.  What
> has been lacking is time.  If you are in a position to do development in
> this area, we're happy to have any assistance you might provide.
>
> Thanks,
> Karl
> ________________________________
> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> Sent: Tuesday, April 20, 2010 5:06 AM
> To: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>
> Subject: Solr and LCF security at query time
>
> Hi,
>
> I don't see in LCF wiki how Solr and LCF works together at query time in
> order to remove from the result list the items the user is not allowed to
> access.
>
> In
> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
> I just see these sentences :
>
> " Once all these documents and their access tokens are handed to the search
> engine, it is the search engine's job to enforce security by excluding
> inappropriate documents from the search results. For Lucene, this
> infrastructure is expected to be built on top of Lucene's generic metadata
> abilities, but has not been implemented at this time."
>
> I am not sure to understand. Does this mean that for the moment, it is not
> possible for Solr to apply security by using an Authority Connector ?
>
> Dominique
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Thanks very much for the diagram -
Sorry about all the questions, but this raises a few new ones...

What is the relationship between stored data (documents) and authorities'
access/deny attributes? (do you have any examples of what an access_token
value might contain?)

One of the key requirements I've worked to adhere to in SOLR-1872 is to
ensure there are no security or other dependencies of indexed data with any
external repository - most notably the file system.
There are many reasons for wanting this, but one of the main ones is that
Solr-stored data is not always based on file data (or accessible file data).
In fact, in my particular case, almost none of the indexed data comes from
files.

This is one reason why SOLR-1872 uses filter queries for its access/deny
tokens - so that all the required information for access control completely
resides within the Solr index itself.
Is the LCF architecture acl 'mapping' between Solr fields (queries) and
users, some external 'repository' (files) and users, or arbitrary data (e.g.
either of these)?

I hope that makes sense...

Thanks!
Peter




On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> I've attached a diagram that is not in the wiki as of yet, and I'll try to
> answer your questions.
>
> >>>>>>
> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
> particular user in the underlying acl store (e.g. Active Directory)?
> How does AD and/or LCF handle storing such data in its schema? (does AD
> needs its schema extended?)
> Presumably, any such AD fields would need to be queried for effective
> rights in order to cater for group membership allows and denies.
> <<<<<<
>
> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings
> that represent a contract between an LCF authority connection and the LCF
> repository connection that picks up the documents (from wherever).  These
> tokens thus have no real meaning outside of LCF.  You must regard them as
> opaque.
>
> The contract, however, states that if you use the LCF authority service to
> obtain tokens for an authenticated user, you will get back a set that is
> CONSISTENT with the tokens that were attached to the documents LCF sent to
> Solr for indexing in the first place.  So, you don't have to worry about it,
> and that's kind of the idea.  So you imagine the following flow:
>
> (1) Use LCF to fetch documents and send them to Solr
> (2) When searching, use the LCF authority service to get the desired user's
> access tokens
> (3) Either filter the results, or modify the query, to be sure the access
> tokens all match up properly
>
> For the AD authority, the LCF access tokens consist, in part, of the user's
> SIDs.  For other authorities, the access tokens are wildly different.  You
> really don't want to know what's in them, since that's the job of the LCF
> authority to determine. ;-)
>
> LCF is not, by the way, joined at the hip with AD.  However, in practice,
> most enterprises in the world use some form of AD single signon for their
> web applications, and even if they're using some repository with its own
> idea of security, there's a mapping between the AD users and the
> repository's users.  Doing that mapping is also the job of the LCF authority
> for that repository.
>
> Hope this helps.  Also, I'm not expecting time miracles here, so don't
> sweat the schedule.
>
>
> Karl
>
>
> ________________________________________
> From: ext Peter Sturge [peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 4:27 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks for the quick turnaround.
> I'm in the middle of a product release for us, so I fear I won't be as
> quick as you... :-)
>
> I couldn't find a simple flow diagram or similar for LCF with regards
> security (probably looking in the wrong place).
> Perhaps you could help on these questions...?
>
> In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries,
> which are then used as filter queries in a user's search.
>
> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
> particular user in the underlying acl store (e.g. Active Directory)?
> How does AD and/or LCF handle storing such data in its schema? (does AD
> needs its schema extended?)
> Presumably, any such AD fields would need to be queried for effective
> rights in order to cater for group membership allows and denies.
>
> I guess I'm just trying to understand the architectural
> flow/storage/retrieval of data in the various parts of the system, but I
> admit, I need to do more research on this.
> After our product release, when I get a few more spare cycles, I can look
> at it in more detail.
>
> Many thanks!
> Peter
>
>
>
> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> Hi Peter,
>
> I just committed the promised changes to the LCF Solr output connector.
>
> ACL metadata will now be posted to the Solr Http interface along with the
> document as the two following fields:
>
> __ACCESS_TOKEN__document
> __DENY_TOKEN__document
>
> There will, of course, potentially be multiple values for each of these two
> fields.
>
> Hope this helps,
> Karl
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> peter.sturge@googlemail.com>]
> Sent: Tuesday, April 20, 2010 6:51 PM
>
> To: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks for the info. I'll have a look at the link and try to take in as
> much sugar as my insulin levels will handle...
> It sounds like the necessary interface(s) are already in LCF - just a
> matter of implementing them in the Solr 1872 plugin.
> I'll need to digest the LCF stuff to get to grips with it..please bear with
> me while I do that...
>
> When you say:
>   The LCF solr output connection doesn't yet do this, but it is trivial for
> me to make that happen.
> Do you mean a mechanism by which solr.war can get url et al info from its
> parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>
>
> Thanks,
> Peter
>
>
>
>
> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> Hi Peter,
>
> I'm the principal committer for LCF, but I don't know as much about Solr as
> I ought to, so it sounds like a potentially productive collaboration.
>
> LCF does exactly what you are looking for - the only issue at all is that
> you need to fetch a URL from a webapp to get what you are looking for.  The
> "plugs" are all inside LCF for different kinds of repositories.  Here's a
> link that might help with drinking the LCF "koolaid", as it were:
> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>
> The url would be something like this (on a locally installed tomcat-based
> LCF instance):
>
>
> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>
> ... and this fetch returns something like:
>
> TOKEN:xxxxxxx
> TOKEN:yyyyyyy
> TOKEN:zzzzzzz
> ....
>
> ... which represent the amalgamated tokens for all of the defined
> authorities, and by some strange coincidence ( ;-) ) are compatible with
> certain pieces of metadata that have been passed into Solr with each
> document - one set of Allow tokens, and a second set of Deny tokens.  The
> LCF solr output connection doesn't yet do this, but it is trivial for me to
> make that happen.
>
> Does this sound plausible to you?
>
> Karl
>
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> peter.sturge@googlemail.com>]
> Sent: Tuesday, April 20, 2010 5:41 PM
> To: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> dev@lucene.apache.org>
>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Integrating LCF to get external token support for SOLR-1872 sounds very
> interesting indeed. I don't know anything about LCF, but one of the things I
> was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
> 'pluggable' - i.e. it would just be one of a series of plugins that could be
> used for obtaining back-end authentication information.
>
> If you're good with LCF, perhaps we could work together to build this in.
> One of the first things would be defining an interface that would be as easy
> as possible to plug LCF into. Have you any suggestions/insight on this
> front?
>
> Many thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> SOLR-1872 looks exactly like what I was envisioning, from the search query
> perspective, although instead of the acl xml file you specify LCF stipulates
> you would dynamically query the lcf-authority-service servlet for the access
> tokens themselves.  That would get you support for AD, Documentum, LiveLink,
> Meridio, and Memex for free. It seems likely that this component could be
> modified to work with LCF with minor effort.
>
> The missing component still seems to be AD authentication, which needs a
> solution.
>
> Karl
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> peter.sturge@googlemail.com>]
> Sent: Tuesday, April 20, 2010 10:44 AM
> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> If you want to do this completely within Solr, have a look at:
> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>
> Thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> FYI
>
> ________________________________
> From: Wright Karl (Nokia-S/Cambridge)
> Sent: Tuesday, April 20, 2010 8:16 AM
> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> connectors-dev@incubator.apache.org<mailto:
> connectors-dev@incubator.apache.org>'; '
> connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>'
> Subject: RE: Solr and LCF security at query time
>
> Dominique,
>
> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a
> powerful multi-repository security model, even though it doesn't yet do the
> final step of enforcing that model at the search end.  LCF allows you to
> define multiple authorities to operate against disparate repositories, and
> use the appropriate authority to secure any given document.  The solr people
> are aware of this design, which addresses the issues raised by SOLR-1834
> very nicely.  However, as I said before, time is a problem, and the work
> still needs to be done.
>
> I suggest you read up on the actual security model of LCF, and perhaps
> experiment with that and the SOLR-1834 contribution, to see if there is
> common ground.  One thing we've learned at MetaCarta is that post-filtering
> for security purposes is expensive, and it is better to modify the queries
> themselves to restrict the results, if possible.  I'm not sure which
> approach SOLR-1834 takes, although it sounds like it might be the filtering
> approach.  Still, it would be better than nothing.
>
> Please let me know what you find out.
>
> Thanks,
> Karl
>
> ________________________________
> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> dominique.bejean@eolya.fr>]
> Sent: Tuesday, April 20, 2010 8:03 AM
> To: Wright Karl (Nokia-S/Cambridge)
> Cc: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>; connectors-dev@incubator.apache.org
> <ma...@incubator.apache.org>
> Subject: Re: Solr and LCF security at query time
>
> Karl,
>
> Thank you for your reply.
>
> I made some research today and I found this :
> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
> http://demo.findwise.se:8880/SolrSecurity/
>
> Sorl security model have to be able to filter result list with items coming
> from various sources at the same time (livelink, documentum, file system,
> ...). Big subject :)
>
> Dominique
>
>
> Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a
> écrit :
> Hi Dominique,
>
> At the moment, in order to enforce the LCF security model within
> Lucene/Solr, you will need to build this functionality into whatever client
> you are using to display the Lucene search results.  Specifically, you would
> need to take the following steps:
>
> (1) Have your users access your search client through Apache.
> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
> the client webapp.
> (3) Have your client webapp alter whatever queries it is doing, to add an
> appropriate query clause for each of the access tokens transmitted in the
> headers.
>
> (This is how it is done at MetaCarta.)
>
> Alternatively, you may find a way to do this completely with a web
> application under a Java app server such as Tomcat.  I have not yet done the
> research to find out whether this is a feasible alternative.  Effectively,
> what you need something like mod_auth_kerb to do is to authenticate your
> user against Active Directory, or whomever the authenticator ought to be.
>  JAAS may be helpful here.
>
> There are, of course, intentions to fill out the missing pieces more
> completely and transparently via a Solr search plugin and/or filter.  What
> has been lacking is time.  If you are in a position to do development in
> this area, we're happy to have any assistance you might provide.
>
> Thanks,
> Karl
> ________________________________
> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> Sent: Tuesday, April 20, 2010 5:06 AM
> To: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>
> Subject: Solr and LCF security at query time
>
> Hi,
>
> I don't see in LCF wiki how Solr and LCF works together at query time in
> order to remove from the result list the items the user is not allowed to
> access.
>
> In
> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
> I just see these sentences :
>
> " Once all these documents and their access tokens are handed to the search
> engine, it is the search engine's job to enforce security by excluding
> inappropriate documents from the search results. For Lucene, this
> infrastructure is expected to be built on top of Lucene's generic metadata
> abilities, but has not been implemented at this time."
>
> I am not sure to understand. Does this mean that for the moment, it is not
> possible for Solr to apply security by using an Authority Connector ?
>
> Dominique
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Thanks very much for the diagram -
Sorry about all the questions, but this raises a few new ones...

What is the relationship between stored data (documents) and authorities'
access/deny attributes? (do you have any examples of what an access_token
value might contain?)

One of the key requirements I've worked to adhere to in SOLR-1872 is to
ensure there are no security or other dependencies of indexed data with any
external repository - most notably the file system.
There are many reasons for wanting this, but one of the main ones is that
Solr-stored data is not always based on file data (or accessible file data).
In fact, in my particular case, almost none of the indexed data comes from
files.

This is one reason why SOLR-1872 uses filter queries for its access/deny
tokens - so that all the required information for access control completely
resides within the Solr index itself.
Is the LCF architecture acl 'mapping' between Solr fields (queries) and
users, some external 'repository' (files) and users, or arbitrary data (e.g.
either of these)?

I hope that makes sense...

Thanks!
Peter




On Thu, Apr 22, 2010 at 10:25 AM, <ka...@nokia.com> wrote:

> Hi Peter,
>
> I've attached a diagram that is not in the wiki as of yet, and I'll try to
> answer your questions.
>
> >>>>>>
> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
> particular user in the underlying acl store (e.g. Active Directory)?
> How does AD and/or LCF handle storing such data in its schema? (does AD
> needs its schema extended?)
> Presumably, any such AD fields would need to be queried for effective
> rights in order to cater for group membership allows and denies.
> <<<<<<
>
> The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings
> that represent a contract between an LCF authority connection and the LCF
> repository connection that picks up the documents (from wherever).  These
> tokens thus have no real meaning outside of LCF.  You must regard them as
> opaque.
>
> The contract, however, states that if you use the LCF authority service to
> obtain tokens for an authenticated user, you will get back a set that is
> CONSISTENT with the tokens that were attached to the documents LCF sent to
> Solr for indexing in the first place.  So, you don't have to worry about it,
> and that's kind of the idea.  So you imagine the following flow:
>
> (1) Use LCF to fetch documents and send them to Solr
> (2) When searching, use the LCF authority service to get the desired user's
> access tokens
> (3) Either filter the results, or modify the query, to be sure the access
> tokens all match up properly
>
> For the AD authority, the LCF access tokens consist, in part, of the user's
> SIDs.  For other authorities, the access tokens are wildly different.  You
> really don't want to know what's in them, since that's the job of the LCF
> authority to determine. ;-)
>
> LCF is not, by the way, joined at the hip with AD.  However, in practice,
> most enterprises in the world use some form of AD single signon for their
> web applications, and even if they're using some repository with its own
> idea of security, there's a mapping between the AD users and the
> repository's users.  Doing that mapping is also the job of the LCF authority
> for that repository.
>
> Hope this helps.  Also, I'm not expecting time miracles here, so don't
> sweat the schedule.
>
>
> Karl
>
>
> ________________________________________
> From: ext Peter Sturge [peter.sturge@googlemail.com]
> Sent: Thursday, April 22, 2010 4:27 AM
> To: dev@lucene.apache.org
> Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org;
> connectors-dev@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks for the quick turnaround.
> I'm in the middle of a product release for us, so I fear I won't be as
> quick as you... :-)
>
> I couldn't find a simple flow diagram or similar for LCF with regards
> security (probably looking in the wrong place).
> Perhaps you could help on these questions...?
>
> In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries,
> which are then used as filter queries in a user's search.
>
> Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
> particular user in the underlying acl store (e.g. Active Directory)?
> How does AD and/or LCF handle storing such data in its schema? (does AD
> needs its schema extended?)
> Presumably, any such AD fields would need to be queried for effective
> rights in order to cater for group membership allows and denies.
>
> I guess I'm just trying to understand the architectural
> flow/storage/retrieval of data in the various parts of the system, but I
> admit, I need to do more research on this.
> After our product release, when I get a few more spare cycles, I can look
> at it in more detail.
>
> Many thanks!
> Peter
>
>
>
> On Thu, Apr 22, 2010 at 1:02 AM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> Hi Peter,
>
> I just committed the promised changes to the LCF Solr output connector.
>
> ACL metadata will now be posted to the Solr Http interface along with the
> document as the two following fields:
>
> __ACCESS_TOKEN__document
> __DENY_TOKEN__document
>
> There will, of course, potentially be multiple values for each of these two
> fields.
>
> Hope this helps,
> Karl
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> peter.sturge@googlemail.com>]
> Sent: Tuesday, April 20, 2010 6:51 PM
>
> To: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks for the info. I'll have a look at the link and try to take in as
> much sugar as my insulin levels will handle...
> It sounds like the necessary interface(s) are already in LCF - just a
> matter of implementing them in the Solr 1872 plugin.
> I'll need to digest the LCF stuff to get to grips with it..please bear with
> me while I do that...
>
> When you say:
>   The LCF solr output connection doesn't yet do this, but it is trivial for
> me to make that happen.
> Do you mean a mechanism by which solr.war can get url et al info from its
> parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>
>
> Thanks,
> Peter
>
>
>
>
> On Tue, Apr 20, 2010 at 11:05 PM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> Hi Peter,
>
> I'm the principal committer for LCF, but I don't know as much about Solr as
> I ought to, so it sounds like a potentially productive collaboration.
>
> LCF does exactly what you are looking for - the only issue at all is that
> you need to fetch a URL from a webapp to get what you are looking for.  The
> "plugs" are all inside LCF for different kinds of repositories.  Here's a
> link that might help with drinking the LCF "koolaid", as it were:
> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>
> The url would be something like this (on a locally installed tomcat-based
> LCF instance):
>
>
> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>
> ... and this fetch returns something like:
>
> TOKEN:xxxxxxx
> TOKEN:yyyyyyy
> TOKEN:zzzzzzz
> ....
>
> ... which represent the amalgamated tokens for all of the defined
> authorities, and by some strange coincidence ( ;-) ) are compatible with
> certain pieces of metadata that have been passed into Solr with each
> document - one set of Allow tokens, and a second set of Deny tokens.  The
> LCF solr output connection doesn't yet do this, but it is trivial for me to
> make that happen.
>
> Does this sound plausible to you?
>
> Karl
>
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> peter.sturge@googlemail.com>]
> Sent: Tuesday, April 20, 2010 5:41 PM
> To: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>; dev@lucene.apache.org<mailto:
> dev@lucene.apache.org>
>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Integrating LCF to get external token support for SOLR-1872 sounds very
> interesting indeed. I don't know anything about LCF, but one of the things I
> was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
> 'pluggable' - i.e. it would just be one of a series of plugins that could be
> used for obtaining back-end authentication information.
>
> If you're good with LCF, perhaps we could work together to build this in.
> One of the first things would be defining an interface that would be as easy
> as possible to plug LCF into. Have you any suggestions/insight on this
> front?
>
> Many thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> SOLR-1872 looks exactly like what I was envisioning, from the search query
> perspective, although instead of the acl xml file you specify LCF stipulates
> you would dynamically query the lcf-authority-service servlet for the access
> tokens themselves.  That would get you support for AD, Documentum, LiveLink,
> Meridio, and Memex for free. It seems likely that this component could be
> modified to work with LCF with minor effort.
>
> The missing component still seems to be AD authentication, which needs a
> solution.
>
> Karl
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<mailto:
> peter.sturge@googlemail.com>]
> Sent: Tuesday, April 20, 2010 10:44 AM
> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> If you want to do this completely within Solr, have a look at:
> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>
> Thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com<mailto:
> karl.wright@nokia.com>> wrote:
> FYI
>
> ________________________________
> From: Wright Karl (Nokia-S/Cambridge)
> Sent: Tuesday, April 20, 2010 8:16 AM
> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> Cc: 'solr-dev@apache.org<ma...@apache.org>'; '
> connectors-dev@incubator.apache.org<mailto:
> connectors-dev@incubator.apache.org>'; '
> connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>'
> Subject: RE: Solr and LCF security at query time
>
> Dominique,
>
> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a
> powerful multi-repository security model, even though it doesn't yet do the
> final step of enforcing that model at the search end.  LCF allows you to
> define multiple authorities to operate against disparate repositories, and
> use the appropriate authority to secure any given document.  The solr people
> are aware of this design, which addresses the issues raised by SOLR-1834
> very nicely.  However, as I said before, time is a problem, and the work
> still needs to be done.
>
> I suggest you read up on the actual security model of LCF, and perhaps
> experiment with that and the SOLR-1834 contribution, to see if there is
> common ground.  One thing we've learned at MetaCarta is that post-filtering
> for security purposes is expensive, and it is better to modify the queries
> themselves to restrict the results, if possible.  I'm not sure which
> approach SOLR-1834 takes, although it sounds like it might be the filtering
> approach.  Still, it would be better than nothing.
>
> Please let me know what you find out.
>
> Thanks,
> Karl
>
> ________________________________
> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<mailto:
> dominique.bejean@eolya.fr>]
> Sent: Tuesday, April 20, 2010 8:03 AM
> To: Wright Karl (Nokia-S/Cambridge)
> Cc: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>; connectors-dev@incubator.apache.org
> <ma...@incubator.apache.org>
> Subject: Re: Solr and LCF security at query time
>
> Karl,
>
> Thank you for your reply.
>
> I made some research today and I found this :
> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
> http://demo.findwise.se:8880/SolrSecurity/
>
> Sorl security model have to be able to filter result list with items coming
> from various sources at the same time (livelink, documentum, file system,
> ...). Big subject :)
>
> Dominique
>
>
> Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a
> écrit :
> Hi Dominique,
>
> At the moment, in order to enforce the LCF security model within
> Lucene/Solr, you will need to build this functionality into whatever client
> you are using to display the Lucene search results.  Specifically, you would
> need to take the following steps:
>
> (1) Have your users access your search client through Apache.
> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
> the client webapp.
> (3) Have your client webapp alter whatever queries it is doing, to add an
> appropriate query clause for each of the access tokens transmitted in the
> headers.
>
> (This is how it is done at MetaCarta.)
>
> Alternatively, you may find a way to do this completely with a web
> application under a Java app server such as Tomcat.  I have not yet done the
> research to find out whether this is a feasible alternative.  Effectively,
> what you need something like mod_auth_kerb to do is to authenticate your
> user against Active Directory, or whomever the authenticator ought to be.
>  JAAS may be helpful here.
>
> There are, of course, intentions to fill out the missing pieces more
> completely and transparently via a Solr search plugin and/or filter.  What
> has been lacking is time.  If you are in a position to do development in
> this area, we're happy to have any assistance you might provide.
>
> Thanks,
> Karl
> ________________________________
> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> Sent: Tuesday, April 20, 2010 5:06 AM
> To: connectors-user@incubator.apache.org<mailto:
> connectors-user@incubator.apache.org>
> Subject: Solr and LCF security at query time
>
> Hi,
>
> I don't see in LCF wiki how Solr and LCF works together at query time in
> order to remove from the result list the items the user is not allowed to
> access.
>
> In
> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
> I just see these sentences :
>
> " Once all these documents and their access tokens are handed to the search
> engine, it is the search engine's job to enforce security by excluding
> inappropriate documents from the search results. For Lucene, this
> infrastructure is expected to be built on top of Lucene's generic metadata
> abilities, but has not been implemented at this time."
>
> I am not sure to understand. Does this mean that for the moment, it is not
> possible for Solr to apply security by using an Authority Connector ?
>
> Dominique
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I've attached a diagram that is not in the wiki as of yet, and I'll try to answer your questions.

>>>>>>
Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.
<<<<<<

The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings that represent a contract between an LCF authority connection and the LCF repository connection that picks up the documents (from wherever).  These tokens thus have no real meaning outside of LCF.  You must regard them as opaque.

The contract, however, states that if you use the LCF authority service to obtain tokens for an authenticated user, you will get back a set that is CONSISTENT with the tokens that were attached to the documents LCF sent to Solr for indexing in the first place.  So, you don't have to worry about it, and that's kind of the idea.  So you imagine the following flow:

(1) Use LCF to fetch documents and send them to Solr
(2) When searching, use the LCF authority service to get the desired user's access tokens
(3) Either filter the results, or modify the query, to be sure the access tokens all match up properly

For the AD authority, the LCF access tokens consist, in part, of the user's SIDs.  For other authorities, the access tokens are wildly different.  You really don't want to know what's in them, since that's the job of the LCF authority to determine. ;-)

LCF is not, by the way, joined at the hip with AD.  However, in practice, most enterprises in the world use some form of AD single signon for their web applications, and even if they're using some repository with its own idea of security, there's a mapping between the AD users and the repository's users.  Doing that mapping is also the job of the LCF authority for that repository.

Hope this helps.  Also, I'm not expecting time miracles here, so don't sweat the schedule.


Karl


________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 4:27 AM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the quick turnaround.
I'm in the middle of a product release for us, so I fear I won't be as quick as you... :-)

I couldn't find a simple flow diagram or similar for LCF with regards security (probably looking in the wrong place).
Perhaps you could help on these questions...?

In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries, which are then used as filter queries in a user's search.

Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.

I guess I'm just trying to understand the architectural flow/storage/retrieval of data in the various parts of the system, but I admit, I need to do more research on this.
After our product release, when I get a few more spare cycles, I can look at it in more detail.

Many thanks!
Peter



On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com>> wrote:
Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 6:51 PM

To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique





RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I've attached a diagram that is not in the wiki as of yet, and I'll try to answer your questions.

>>>>>>
Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.
<<<<<<

The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings that represent a contract between an LCF authority connection and the LCF repository connection that picks up the documents (from wherever).  These tokens thus have no real meaning outside of LCF.  You must regard them as opaque.

The contract, however, states that if you use the LCF authority service to obtain tokens for an authenticated user, you will get back a set that is CONSISTENT with the tokens that were attached to the documents LCF sent to Solr for indexing in the first place.  So, you don't have to worry about it, and that's kind of the idea.  So you imagine the following flow:

(1) Use LCF to fetch documents and send them to Solr
(2) When searching, use the LCF authority service to get the desired user's access tokens
(3) Either filter the results, or modify the query, to be sure the access tokens all match up properly

For the AD authority, the LCF access tokens consist, in part, of the user's SIDs.  For other authorities, the access tokens are wildly different.  You really don't want to know what's in them, since that's the job of the LCF authority to determine. ;-)

LCF is not, by the way, joined at the hip with AD.  However, in practice, most enterprises in the world use some form of AD single signon for their web applications, and even if they're using some repository with its own idea of security, there's a mapping between the AD users and the repository's users.  Doing that mapping is also the job of the LCF authority for that repository.

Hope this helps.  Also, I'm not expecting time miracles here, so don't sweat the schedule.


Karl


________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 4:27 AM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the quick turnaround.
I'm in the middle of a product release for us, so I fear I won't be as quick as you... :-)

I couldn't find a simple flow diagram or similar for LCF with regards security (probably looking in the wrong place).
Perhaps you could help on these questions...?

In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries, which are then used as filter queries in a user's search.

Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.

I guess I'm just trying to understand the architectural flow/storage/retrieval of data in the various parts of the system, but I admit, I need to do more research on this.
After our product release, when I get a few more spare cycles, I can look at it in more detail.

Many thanks!
Peter



On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com>> wrote:
Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 6:51 PM

To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique





RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I've attached a diagram that is not in the wiki as of yet, and I'll try to answer your questions.

>>>>>>
Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.
<<<<<<

The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings that represent a contract between an LCF authority connection and the LCF repository connection that picks up the documents (from wherever).  These tokens thus have no real meaning outside of LCF.  You must regard them as opaque.

The contract, however, states that if you use the LCF authority service to obtain tokens for an authenticated user, you will get back a set that is CONSISTENT with the tokens that were attached to the documents LCF sent to Solr for indexing in the first place.  So, you don't have to worry about it, and that's kind of the idea.  So you imagine the following flow:

(1) Use LCF to fetch documents and send them to Solr
(2) When searching, use the LCF authority service to get the desired user's access tokens
(3) Either filter the results, or modify the query, to be sure the access tokens all match up properly

For the AD authority, the LCF access tokens consist, in part, of the user's SIDs.  For other authorities, the access tokens are wildly different.  You really don't want to know what's in them, since that's the job of the LCF authority to determine. ;-)

LCF is not, by the way, joined at the hip with AD.  However, in practice, most enterprises in the world use some form of AD single signon for their web applications, and even if they're using some repository with its own idea of security, there's a mapping between the AD users and the repository's users.  Doing that mapping is also the job of the LCF authority for that repository.

Hope this helps.  Also, I'm not expecting time miracles here, so don't sweat the schedule.


Karl


________________________________________
From: ext Peter Sturge [peter.sturge@googlemail.com]
Sent: Thursday, April 22, 2010 4:27 AM
To: dev@lucene.apache.org
Cc: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the quick turnaround.
I'm in the middle of a product release for us, so I fear I won't be as quick as you... :-)

I couldn't find a simple flow diagram or similar for LCF with regards security (probably looking in the wrong place).
Perhaps you could help on these questions...?

In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries, which are then used as filter queries in a user's search.

Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in order to cater for group membership allows and denies.

I guess I'm just trying to understand the architectural flow/storage/retrieval of data in the various parts of the system, but I admit, I need to do more research on this.
After our product release, when I get a few more spare cycles, I can look at it in more detail.

Many thanks!
Peter



On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com>> wrote:
Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 6:51 PM

To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique





Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Thanks for the quick turnaround.
I'm in the middle of a product release for us, so I fear I won't be as quick
as you... :-)

I couldn't find a simple flow diagram or similar for LCF with regards
security (probably looking in the wrong place).
Perhaps you could help on these questions...?

In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries,
which are then used as filter queries in a user's search.

Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD
needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights
in order to cater for group membership allows and denies.

I guess I'm just trying to understand the architectural
flow/storage/retrieval of data in the various parts of the system, but I
admit, I need to do more research on this.
After our product release, when I get a few more spare cycles, I can look at
it in more detail.

Many thanks!
Peter



On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com> wrote:

>  Hi Peter,
>
> I just committed the promised changes to the LCF Solr output connector.
>
> ACL metadata will now be posted to the Solr Http interface along with the
> document as the two following fields:
>
> __ACCESS_TOKEN__document
> __DENY_TOKEN__document
>
> There will, of course, potentially be multiple values for each of these two
> fields.
>
> Hope this helps,
> Karl
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Tuesday, April 20, 2010 6:51 PM
>
> *To:* connectors-user@incubator.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks for the info. I'll have a look at the link and try to take in as
> much sugar as my insulin levels will handle...
> It sounds like the necessary interface(s) are already in LCF - just a
> matter of implementing them in the Solr 1872 plugin.
> I'll need to digest the LCF stuff to get to grips with it..please bear with
> me while I do that...
>
> When you say:
>    The LCF solr output connection doesn't yet do this, but it is trivial
> for me to make that happen.
> Do you mean a mechanism by which solr.war can get url et al info from its
> parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>
>
> Thanks,
> Peter
>
>
>
>
> On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com> wrote:
>
>>  Hi Peter,
>>
>> I'm the principal committer for LCF, but I don't know as much about Solr
>> as I ought to, so it sounds like a potentially productive collaboration.
>>
>> LCF does exactly what you are looking for - the only issue at all is that
>> you need to fetch a URL from a webapp to get what you are looking for.  The
>> "plugs" are all inside LCF for different kinds of repositories.  Here's a
>> link that might help with drinking the LCF "koolaid", as it were:
>> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>>
>> The url would be something like this (on a locally installed tomcat-based
>> LCF instance):
>>
>>
>> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>>
>> ... and this fetch returns something like:
>>
>> TOKEN:xxxxxxx
>> TOKEN:yyyyyyy
>> TOKEN:zzzzzzz
>> ....
>>
>> ... which represent the amalgamated tokens for all of the defined
>> authorities, and by some strange coincidence ( ;-) ) are compatible
>> with certain pieces of metadata that have been passed into Solr with each
>> document - one set of Allow tokens, and a second set of Deny tokens.  The
>> LCF solr output connection doesn't yet do this, but it is trivial for me to
>> make that happen.
>>
>> Does this sound plausible to you?
>>
>> Karl
>>
>>
>>  ------------------------------
>>  *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> *Sent:* Tuesday, April 20, 2010 5:41 PM
>> *To:* connectors-user@incubator.apache.org; dev@lucene.apache.org
>>
>> *Subject:* Re: FW: Solr and LCF security at query time
>>
>>   Hi Karl,
>>
>> Integrating LCF to get external token support for SOLR-1872 sounds very
>> interesting indeed. I don't know anything about LCF, but one of the things I
>> was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
>> 'pluggable' - i.e. it would just be one of a series of plugins that could be
>> used for obtaining back-end authentication information.
>>
>> If you're good with LCF, perhaps we could work together to build this in.
>> One of the first things would be defining an interface that would be as easy
>> as possible to plug LCF into. Have you any suggestions/insight on this
>> front?
>>
>> Many thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com> wrote:
>>
>>>  SOLR-1872 looks exactly like what I was envisioning, from the search
>>> query perspective, although instead of the acl xml file you specify LCF
>>> stipulates you would dynamically query the lcf-authority-service servlet for
>>> the access tokens themselves.  That would get you support for AD,
>>> Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this
>>> component could be modified to work with LCF with minor effort.
>>>
>>> The missing component still seems to be AD authentication, which needs a
>>> solution.
>>>
>>> Karl
>>>
>>>  ------------------------------
>>> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>>> *Sent:* Tuesday, April 20, 2010 10:44 AM
>>> *To:* dev@lucene.apache.org
>>> *Subject:* Re: FW: Solr and LCF security at query time
>>>
>>>   If you want to do this completely within Solr, have a look at:
>>> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>>
>>> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com> wrote:
>>>
>>>>  FYI
>>>>
>>>>  ------------------------------
>>>> *From:* Wright Karl (Nokia-S/Cambridge)
>>>> *Sent:* Tuesday, April 20, 2010 8:16 AM
>>>> *To:* 'dominique.bejean@eolya.fr'
>>>> *Cc:* 'solr-dev@apache.org'; 'connectors-dev@incubator.apache.org'; '
>>>> connectors-user@incubator.apache.org'
>>>> *Subject:* RE: Solr and LCF security at query time
>>>>
>>>>   Dominique,
>>>>
>>>> Yes, I am aware of this ticket and contribution.  Luckily LCF
>>>> establishes a powerful multi-repository security model, even though it
>>>> doesn't yet do the final step of enforcing that model at the search end.
>>>> LCF allows you to define multiple authorities to operate against disparate
>>>> repositories, and use the appropriate authority to secure any given
>>>> document.  The solr people are aware of this design, which addresses the
>>>> issues raised by SOLR-1834 very nicely.  However, as I said before, time is
>>>> a problem, and the work still needs to be done.
>>>>
>>>> I suggest you read up on the actual security model of LCF, and perhaps
>>>> experiment with that and the SOLR-1834 contribution, to see if there is
>>>> common ground.  One thing we've learned at MetaCarta is that post-filtering
>>>> for security purposes is expensive, and it is better to modify the queries
>>>> themselves to restrict the results, if possible.  I'm not sure which
>>>> approach SOLR-1834 takes, although it sounds like it might be the filtering
>>>> approach.  Still, it would be better than nothing.
>>>>
>>>> Please let me know what you find out.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>  ------------------------------
>>>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>>>> *Sent:* Tuesday, April 20, 2010 8:03 AM
>>>> *To:* Wright Karl (Nokia-S/Cambridge)
>>>> *Cc:* connectors-user@incubator.apache.org;
>>>> connectors-dev@incubator.apache.org
>>>> *Subject:* Re: Solr and LCF security at query time
>>>>
>>>> Karl,
>>>>
>>>> Thank you for your reply.
>>>>
>>>> I made some research today and I found this :
>>>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
>>>> http://demo.findwise.se:8880/SolrSecurity/
>>>>
>>>> Sorl security model have to be able to filter result list with items
>>>> coming from various sources at the same time (livelink, documentum, file
>>>> system, ...). Big subject :)
>>>>
>>>> Dominique
>>>>
>>>>
>>>> Le 20/04/10 13:34, karl.wright@nokia.com a écrit :
>>>>
>>>> Hi Dominique,
>>>>
>>>> At the moment, in order to enforce the LCF security model within
>>>> Lucene/Solr, you will need to build this functionality into whatever client
>>>> you are using to display the Lucene search results.  Specifically, you would
>>>> need to take the following steps:
>>>>
>>>> (1) Have your users access your search client through Apache.
>>>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>>>> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
>>>> the client webapp.
>>>> (3) Have your client webapp alter whatever queries it is doing, to add
>>>> an appropriate query clause for each of the access tokens transmitted in the
>>>> headers.
>>>>
>>>> (This is how it is done at MetaCarta.)
>>>>
>>>> Alternatively, you may find a way to do this completely with a web
>>>> application under a Java app server such as Tomcat.  I have not yet done the
>>>> research to find out whether this is a feasible alternative.  Effectively,
>>>> what you need something like mod_auth_kerb to do is to authenticate your
>>>> user against Active Directory, or whomever the authenticator ought to be.
>>>> JAAS may be helpful here.
>>>>
>>>> There are, of course, intentions to fill out the missing pieces more
>>>> completely and transparently via a Solr search plugin and/or filter.  What
>>>> has been lacking is time.  If you are in a position to do development in
>>>> this area, we're happy to have any assistance you might provide.
>>>>
>>>> Thanks,
>>>> Karl
>>>>  ------------------------------
>>>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<do...@eolya.fr>]
>>>>
>>>> *Sent:* Tuesday, April 20, 2010 5:06 AM
>>>> *To:* connectors-user@incubator.apache.org
>>>> *Subject:* Solr and LCF security at query time
>>>>
>>>> Hi,
>>>>
>>>> I don't see in LCF wiki how Solr and LCF works together at query time in
>>>> order to remove from the result list the items the user is not allowed to
>>>> access.
>>>>
>>>> In
>>>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
>>>> I just see these sentences :
>>>>
>>>> " Once all these documents and their access tokens are handed to the
>>>> search engine, it is the search engine's job to enforce security by
>>>> excluding inappropriate documents from the search results. For *Lucene*,
>>>> this infrastructure is expected to be built on top of Lucene's generic
>>>> metadata abilities, but has not been implemented at this time."
>>>>
>>>> I am not sure to understand. Does this mean that for the moment, it is
>>>> not possible for Solr to apply security by using an Authority Connector ?
>>>>
>>>> Dominique
>>>>
>>>>
>>>
>>
>

Re: Solr and LCF security at query time

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
OpenAM could have some interesting stuff: http://forgerock.com/openam.html

Jan

On 4. okt. 2010, at 10:48, <ka...@nokia.com> wrote:

> Is there Kerberos support in this offering?  That's what's missing.  LDAP support is actually built into java, and the Active Directory authority makes use of it.  So all we need is the authentication piece.
> 
> Karl
> 
> ________________________________________
> From: ext Lance Norskog [goksron@gmail.com]
> Sent: Sunday, October 03, 2010 10:58 PM
> To: dev@lucene.apache.org
> Subject: Re: FW: Solr and LCF security at query time
> 
> www.openldap.org
> 
> Haven't used it. Here's the License:
> 
> http://www.openldap.org/software/release/license.html
> 
> karl.wright@nokia.com wrote:
>> Looking around for no-Apache java-only solutions to the AD authentication problem, it seems to me that what we mainly have available is JAAS plus the following JAAS login module:
>> 
>> com.sun.security.auth.module.Krb5LoginModule
>> 
>> ... which should permit AD authentication to take place,  if properly configured.
>> So, we *could* stipulate that the search component receive credentials, somehow, upon being called, and then authenticate each time.  (There's a ticket cache involved, so this is not as insane as it sounds).
>> 
>> But this architecture option makes me twitchy because I am unclear how exactly this would help Tomcat interact with the browser to manage signon for a web application.  So it might be better to push the authentication itself upstream into a module meant to be plugged into Tomcat, and have Solr just receive and deal with the resulting ticket, and/or an authenticated, domain-qualified user name.  The task of the LCF Solr search component or filter would then be to do the following:
>> 
>> (1) Get hold of the ticket/authenticated user name, which will probably come in as some attribute to the search that's presented to Solr.  (Someone needs to specify what this attribute is called still).
>> (2) Invoke a configured LCF authority service with that user name, via http, and get back a list of access tokens for the user
>> (3) Form the search expression with the user's access tokens (if it's a search component), or filter the results using those access tokens (if it's a filter), remembering that every document that's participating in security should have __ACCESS_TOKEN__document and __DENY_TOKEN__document metadata
>> 
>> I've also been pondering whether which we should build: a search component or filter?  I think there are advantages to both, so I think we should build both, and let people use what they need.
>> 
>> I think the technical aspects of building the Solr component are well understood by this group, so the only open issue remains how to build a JAAS-based AD authentication module for tomcat that would do what we needed.  I'll be doing more research as time permits...
>> 
>> Karl
>> 
>> ________________________________________
>> From: Wright Karl (Nokia-S/Cambridge)
>> Sent: Wednesday, April 21, 2010 8:02 PM
>> To: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
>> Subject: RE: FW: Solr and LCF security at query time
>> 
>> Hi Peter,
>> 
>> I just committed the promised changes to the LCF Solr output connector.
>> 
>> ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:
>> 
>> __ACCESS_TOKEN__document
>> __DENY_TOKEN__document
>> 
>> There will, of course, potentially be multiple values for each of these two fields.
>> 
>> Hope this helps,
>> Karl
>> 
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> Sent: Tuesday, April 20, 2010 6:51 PM
>> To: connectors-user@incubator.apache.org
>> Subject: Re: FW: Solr and LCF security at query time
>> 
>> Hi Karl,
>> 
>> Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
>> It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
>> I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...
>> 
>> When you say:
>>    The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
>> Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>> 
>> 
>> Thanks,
>> Peter
>> 
>> 
>> 
>> 
>> On Tue, Apr 20, 2010 at 11:05 PM,<ka...@nokia.com>>  wrote:
>> Hi Peter,
>> 
>> I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.
>> 
>> LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>> 
>> The url would be something like this (on a locally installed tomcat-based LCF instance):
>> 
>> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>> 
>> ... and this fetch returns something like:
>> 
>> TOKEN:xxxxxxx
>> TOKEN:yyyyyyy
>> TOKEN:zzzzzzz
>> ....
>> 
>> ... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
>> 
>> Does this sound plausible to you?
>> 
>> Karl
>> 
>> 
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 5:41 PM
>> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>
>> 
>> Subject: Re: FW: Solr and LCF security at query time
>> 
>> Hi Karl,
>> 
>> Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.
>> 
>> If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?
>> 
>> Many thanks,
>> Peter
>> 
>> 
>> 
>> On Tue, Apr 20, 2010 at 4:08 PM,<ka...@nokia.com>>  wrote:
>> SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.
>> 
>> The missing component still seems to be AD authentication, which needs a solution.
>> 
>> Karl
>> 
>> ________________________________
>> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
>> Sent: Tuesday, April 20, 2010 10:44 AM
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: Re: FW: Solr and LCF security at query time
>> 
>> If you want to do this completely within Solr, have a look at:
>> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>> 
>> Thanks,
>> Peter
>> 
>> 
>> 
>> On Tue, Apr 20, 2010 at 1:25 PM,<ka...@nokia.com>>  wrote:
>> FYI
>> 
>> ________________________________
>> From: Wright Karl (Nokia-S/Cambridge)
>> Sent: Tuesday, April 20, 2010 8:16 AM
>> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
>> Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
>> Subject: RE: Solr and LCF security at query time
>> 
>> Dominique,
>> 
>> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.
>> 
>> I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.
>> 
>> Please let me know what you find out.
>> 
>> Thanks,
>> Karl
>> 
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
>> Sent: Tuesday, April 20, 2010 8:03 AM
>> To: Wright Karl (Nokia-S/Cambridge)
>> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
>> Subject: Re: Solr and LCF security at query time
>> 
>> Karl,
>> 
>> Thank you for your reply.
>> 
>> I made some research today and I found this :
>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
>> http://demo.findwise.se:8880/SolrSecurity/
>> 
>> Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)
>> 
>> Dominique
>> 
>> 
>> Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com>  a écrit :
>> Hi Dominique,
>> 
>> At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:
>> 
>> (1) Have your users access your search client through Apache.
>> (2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
>> (3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.
>> 
>> (This is how it is done at MetaCarta.)
>> 
>> Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.
>> 
>> There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.
>> 
>> Thanks,
>> Karl
>> ________________________________
>> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> Sent: Tuesday, April 20, 2010 5:06 AM
>> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
>> Subject: Solr and LCF security at query time
>> 
>> Hi,
>> 
>> I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.
>> 
>> In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :
>> 
>> " Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."
>> 
>> I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?
>> 
>> Dominique
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Is there Kerberos support in this offering?  That's what's missing.  LDAP support is actually built into java, and the Active Directory authority makes use of it.  So all we need is the authentication piece.

Karl

________________________________________
From: ext Lance Norskog [goksron@gmail.com]
Sent: Sunday, October 03, 2010 10:58 PM
To: dev@lucene.apache.org
Subject: Re: FW: Solr and LCF security at query time

www.openldap.org

Haven't used it. Here's the License:

http://www.openldap.org/software/release/license.html

karl.wright@nokia.com wrote:
> Looking around for no-Apache java-only solutions to the AD authentication problem, it seems to me that what we mainly have available is JAAS plus the following JAAS login module:
>
> com.sun.security.auth.module.Krb5LoginModule
>
> ... which should permit AD authentication to take place,  if properly configured.
> So, we *could* stipulate that the search component receive credentials, somehow, upon being called, and then authenticate each time.  (There's a ticket cache involved, so this is not as insane as it sounds).
>
> But this architecture option makes me twitchy because I am unclear how exactly this would help Tomcat interact with the browser to manage signon for a web application.  So it might be better to push the authentication itself upstream into a module meant to be plugged into Tomcat, and have Solr just receive and deal with the resulting ticket, and/or an authenticated, domain-qualified user name.  The task of the LCF Solr search component or filter would then be to do the following:
>
> (1) Get hold of the ticket/authenticated user name, which will probably come in as some attribute to the search that's presented to Solr.  (Someone needs to specify what this attribute is called still).
> (2) Invoke a configured LCF authority service with that user name, via http, and get back a list of access tokens for the user
> (3) Form the search expression with the user's access tokens (if it's a search component), or filter the results using those access tokens (if it's a filter), remembering that every document that's participating in security should have __ACCESS_TOKEN__document and __DENY_TOKEN__document metadata
>
> I've also been pondering whether which we should build: a search component or filter?  I think there are advantages to both, so I think we should build both, and let people use what they need.
>
> I think the technical aspects of building the Solr component are well understood by this group, so the only open issue remains how to build a JAAS-based AD authentication module for tomcat that would do what we needed.  I'll be doing more research as time permits...
>
> Karl
>
> ________________________________________
> From: Wright Karl (Nokia-S/Cambridge)
> Sent: Wednesday, April 21, 2010 8:02 PM
> To: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
> Subject: RE: FW: Solr and LCF security at query time
>
> Hi Peter,
>
> I just committed the promised changes to the LCF Solr output connector.
>
> ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:
>
> __ACCESS_TOKEN__document
> __DENY_TOKEN__document
>
> There will, of course, potentially be multiple values for each of these two fields.
>
> Hope this helps,
> Karl
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Tuesday, April 20, 2010 6:51 PM
> To: connectors-user@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
> It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
> I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...
>
> When you say:
>     The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
> Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>
>
> Thanks,
> Peter
>
>
>
>
> On Tue, Apr 20, 2010 at 11:05 PM,<ka...@nokia.com>>  wrote:
> Hi Peter,
>
> I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.
>
> LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>
> The url would be something like this (on a locally installed tomcat-based LCF instance):
>
> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>
> ... and this fetch returns something like:
>
> TOKEN:xxxxxxx
> TOKEN:yyyyyyy
> TOKEN:zzzzzzz
> ....
>
> ... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
>
> Does this sound plausible to you?
>
> Karl
>
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> Sent: Tuesday, April 20, 2010 5:41 PM
> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>
>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.
>
> If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?
>
> Many thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 4:08 PM,<ka...@nokia.com>>  wrote:
> SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.
>
> The missing component still seems to be AD authentication, which needs a solution.
>
> Karl
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> Sent: Tuesday, April 20, 2010 10:44 AM
> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> If you want to do this completely within Solr, have a look at:
> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>
> Thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 1:25 PM,<ka...@nokia.com>>  wrote:
> FYI
>
> ________________________________
> From: Wright Karl (Nokia-S/Cambridge)
> Sent: Tuesday, April 20, 2010 8:16 AM
> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
> Subject: RE: Solr and LCF security at query time
>
> Dominique,
>
> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.
>
> I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.
>
> Please let me know what you find out.
>
> Thanks,
> Karl
>
> ________________________________
> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
> Sent: Tuesday, April 20, 2010 8:03 AM
> To: Wright Karl (Nokia-S/Cambridge)
> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Re: Solr and LCF security at query time
>
> Karl,
>
> Thank you for your reply.
>
> I made some research today and I found this :
> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
> http://demo.findwise.se:8880/SolrSecurity/
>
> Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)
>
> Dominique
>
>
> Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com>  a écrit :
> Hi Dominique,
>
> At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:
>
> (1) Have your users access your search client through Apache.
> (2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
> (3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.
>
> (This is how it is done at MetaCarta.)
>
> Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.
>
> There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.
>
> Thanks,
> Karl
> ________________________________
> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> Sent: Tuesday, April 20, 2010 5:06 AM
> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Solr and LCF security at query time
>
> Hi,
>
> I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.
>
> In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :
>
> " Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."
>
> I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?
>
> Dominique
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: FW: Solr and LCF security at query time

Posted by Lance Norskog <go...@gmail.com>.
www.openldap.org

Haven't used it. Here's the License:

http://www.openldap.org/software/release/license.html

karl.wright@nokia.com wrote:
> Looking around for no-Apache java-only solutions to the AD authentication problem, it seems to me that what we mainly have available is JAAS plus the following JAAS login module:
>
> com.sun.security.auth.module.Krb5LoginModule
>
> ... which should permit AD authentication to take place,  if properly configured.
> So, we *could* stipulate that the search component receive credentials, somehow, upon being called, and then authenticate each time.  (There's a ticket cache involved, so this is not as insane as it sounds).
>
> But this architecture option makes me twitchy because I am unclear how exactly this would help Tomcat interact with the browser to manage signon for a web application.  So it might be better to push the authentication itself upstream into a module meant to be plugged into Tomcat, and have Solr just receive and deal with the resulting ticket, and/or an authenticated, domain-qualified user name.  The task of the LCF Solr search component or filter would then be to do the following:
>
> (1) Get hold of the ticket/authenticated user name, which will probably come in as some attribute to the search that's presented to Solr.  (Someone needs to specify what this attribute is called still).
> (2) Invoke a configured LCF authority service with that user name, via http, and get back a list of access tokens for the user
> (3) Form the search expression with the user's access tokens (if it's a search component), or filter the results using those access tokens (if it's a filter), remembering that every document that's participating in security should have __ACCESS_TOKEN__document and __DENY_TOKEN__document metadata
>
> I've also been pondering whether which we should build: a search component or filter?  I think there are advantages to both, so I think we should build both, and let people use what they need.
>
> I think the technical aspects of building the Solr component are well understood by this group, so the only open issue remains how to build a JAAS-based AD authentication module for tomcat that would do what we needed.  I'll be doing more research as time permits...
>
> Karl
>
> ________________________________________
> From: Wright Karl (Nokia-S/Cambridge)
> Sent: Wednesday, April 21, 2010 8:02 PM
> To: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
> Subject: RE: FW: Solr and LCF security at query time
>
> Hi Peter,
>
> I just committed the promised changes to the LCF Solr output connector.
>
> ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:
>
> __ACCESS_TOKEN__document
> __DENY_TOKEN__document
>
> There will, of course, potentially be multiple values for each of these two fields.
>
> Hope this helps,
> Karl
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> Sent: Tuesday, April 20, 2010 6:51 PM
> To: connectors-user@incubator.apache.org
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
> It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
> I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...
>
> When you say:
>     The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
> Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>
>
> Thanks,
> Peter
>
>
>
>
> On Tue, Apr 20, 2010 at 11:05 PM,<ka...@nokia.com>>  wrote:
> Hi Peter,
>
> I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.
>
> LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>
> The url would be something like this (on a locally installed tomcat-based LCF instance):
>
> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>
> ... and this fetch returns something like:
>
> TOKEN:xxxxxxx
> TOKEN:yyyyyyy
> TOKEN:zzzzzzz
> ....
>
> ... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
>
> Does this sound plausible to you?
>
> Karl
>
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> Sent: Tuesday, April 20, 2010 5:41 PM
> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>
>
> Subject: Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.
>
> If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?
>
> Many thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 4:08 PM,<ka...@nokia.com>>  wrote:
> SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.
>
> The missing component still seems to be AD authentication, which needs a solution.
>
> Karl
>
> ________________________________
> From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
> Sent: Tuesday, April 20, 2010 10:44 AM
> To: dev@lucene.apache.org<ma...@lucene.apache.org>
> Subject: Re: FW: Solr and LCF security at query time
>
> If you want to do this completely within Solr, have a look at:
> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>
> Thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 1:25 PM,<ka...@nokia.com>>  wrote:
> FYI
>
> ________________________________
> From: Wright Karl (Nokia-S/Cambridge)
> Sent: Tuesday, April 20, 2010 8:16 AM
> To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
> Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
> Subject: RE: Solr and LCF security at query time
>
> Dominique,
>
> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.
>
> I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.
>
> Please let me know what you find out.
>
> Thanks,
> Karl
>
> ________________________________
> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
> Sent: Tuesday, April 20, 2010 8:03 AM
> To: Wright Karl (Nokia-S/Cambridge)
> Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Re: Solr and LCF security at query time
>
> Karl,
>
> Thank you for your reply.
>
> I made some research today and I found this :
> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
> http://demo.findwise.se:8880/SolrSecurity/
>
> Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)
>
> Dominique
>
>
> Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com>  a écrit :
> Hi Dominique,
>
> At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:
>
> (1) Have your users access your search client through Apache.
> (2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
> (3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.
>
> (This is how it is done at MetaCarta.)
>
> Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.
>
> There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.
>
> Thanks,
> Karl
> ________________________________
> From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> Sent: Tuesday, April 20, 2010 5:06 AM
> To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
> Subject: Solr and LCF security at query time
>
> Hi,
>
> I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.
>
> In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :
>
> " Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."
>
> I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?
>
> Dominique
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>    

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Looking around for no-Apache java-only solutions to the AD authentication problem, it seems to me that what we mainly have available is JAAS plus the following JAAS login module:

com.sun.security.auth.module.Krb5LoginModule

... which should permit AD authentication to take place,  if properly configured.
So, we *could* stipulate that the search component receive credentials, somehow, upon being called, and then authenticate each time.  (There's a ticket cache involved, so this is not as insane as it sounds).

But this architecture option makes me twitchy because I am unclear how exactly this would help Tomcat interact with the browser to manage signon for a web application.  So it might be better to push the authentication itself upstream into a module meant to be plugged into Tomcat, and have Solr just receive and deal with the resulting ticket, and/or an authenticated, domain-qualified user name.  The task of the LCF Solr search component or filter would then be to do the following:

(1) Get hold of the ticket/authenticated user name, which will probably come in as some attribute to the search that's presented to Solr.  (Someone needs to specify what this attribute is called still).
(2) Invoke a configured LCF authority service with that user name, via http, and get back a list of access tokens for the user
(3) Form the search expression with the user's access tokens (if it's a search component), or filter the results using those access tokens (if it's a filter), remembering that every document that's participating in security should have __ACCESS_TOKEN__document and __DENY_TOKEN__document metadata

I've also been pondering whether which we should build: a search component or filter?  I think there are advantages to both, so I think we should build both, and let people use what they need.

I think the technical aspects of building the Solr component are well understood by this group, so the only open issue remains how to build a JAAS-based AD authentication module for tomcat that would do what we needed.  I'll be doing more research as time permits...

Karl

________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Wednesday, April 21, 2010 8:02 PM
To: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 6:51 PM
To: connectors-user@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique




RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Looking around for no-Apache java-only solutions to the AD authentication problem, it seems to me that what we mainly have available is JAAS plus the following JAAS login module:

com.sun.security.auth.module.Krb5LoginModule

... which should permit AD authentication to take place,  if properly configured.
So, we *could* stipulate that the search component receive credentials, somehow, upon being called, and then authenticate each time.  (There's a ticket cache involved, so this is not as insane as it sounds).

But this architecture option makes me twitchy because I am unclear how exactly this would help Tomcat interact with the browser to manage signon for a web application.  So it might be better to push the authentication itself upstream into a module meant to be plugged into Tomcat, and have Solr just receive and deal with the resulting ticket, and/or an authenticated, domain-qualified user name.  The task of the LCF Solr search component or filter would then be to do the following:

(1) Get hold of the ticket/authenticated user name, which will probably come in as some attribute to the search that's presented to Solr.  (Someone needs to specify what this attribute is called still).
(2) Invoke a configured LCF authority service with that user name, via http, and get back a list of access tokens for the user
(3) Form the search expression with the user's access tokens (if it's a search component), or filter the results using those access tokens (if it's a filter), remembering that every document that's participating in security should have __ACCESS_TOKEN__document and __DENY_TOKEN__document metadata

I've also been pondering whether which we should build: a search component or filter?  I think there are advantages to both, so I think we should build both, and let people use what they need.

I think the technical aspects of building the Solr component are well understood by this group, so the only open issue remains how to build a JAAS-based AD authentication module for tomcat that would do what we needed.  I'll be doing more research as time permits...

Karl

________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Wednesday, April 21, 2010 8:02 PM
To: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 6:51 PM
To: connectors-user@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Looking around for no-Apache java-only solutions to the AD authentication problem, it seems to me that what we mainly have available is JAAS plus the following JAAS login module:

com.sun.security.auth.module.Krb5LoginModule

... which should permit AD authentication to take place,  if properly configured.
So, we *could* stipulate that the search component receive credentials, somehow, upon being called, and then authenticate each time.  (There's a ticket cache involved, so this is not as insane as it sounds).

But this architecture option makes me twitchy because I am unclear how exactly this would help Tomcat interact with the browser to manage signon for a web application.  So it might be better to push the authentication itself upstream into a module meant to be plugged into Tomcat, and have Solr just receive and deal with the resulting ticket, and/or an authenticated, domain-qualified user name.  The task of the LCF Solr search component or filter would then be to do the following:

(1) Get hold of the ticket/authenticated user name, which will probably come in as some attribute to the search that's presented to Solr.  (Someone needs to specify what this attribute is called still).
(2) Invoke a configured LCF authority service with that user name, via http, and get back a list of access tokens for the user
(3) Form the search expression with the user's access tokens (if it's a search component), or filter the results using those access tokens (if it's a filter), remembering that every document that's participating in security should have __ACCESS_TOKEN__document and __DENY_TOKEN__document metadata

I've also been pondering whether which we should build: a search component or filter?  I think there are advantages to both, so I think we should build both, and let people use what they need.

I think the technical aspects of building the Solr component are well understood by this group, so the only open issue remains how to build a JAAS-based AD authentication module for tomcat that would do what we needed.  I'll be doing more research as time permits...

Karl

________________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Wednesday, April 21, 2010 8:02 PM
To: connectors-user@incubator.apache.org; lucene-dev@apache.org; connectors-dev@incubator.apache.org
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 6:51 PM
To: connectors-user@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique




Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Thanks for the quick turnaround.
I'm in the middle of a product release for us, so I fear I won't be as quick
as you... :-)

I couldn't find a simple flow diagram or similar for LCF with regards
security (probably looking in the wrong place).
Perhaps you could help on these questions...?

In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries,
which are then used as filter queries in a user's search.

Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD
needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights
in order to cater for group membership allows and denies.

I guess I'm just trying to understand the architectural
flow/storage/retrieval of data in the various parts of the system, but I
admit, I need to do more research on this.
After our product release, when I get a few more spare cycles, I can look at
it in more detail.

Many thanks!
Peter



On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com> wrote:

>  Hi Peter,
>
> I just committed the promised changes to the LCF Solr output connector.
>
> ACL metadata will now be posted to the Solr Http interface along with the
> document as the two following fields:
>
> __ACCESS_TOKEN__document
> __DENY_TOKEN__document
>
> There will, of course, potentially be multiple values for each of these two
> fields.
>
> Hope this helps,
> Karl
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Tuesday, April 20, 2010 6:51 PM
>
> *To:* connectors-user@incubator.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks for the info. I'll have a look at the link and try to take in as
> much sugar as my insulin levels will handle...
> It sounds like the necessary interface(s) are already in LCF - just a
> matter of implementing them in the Solr 1872 plugin.
> I'll need to digest the LCF stuff to get to grips with it..please bear with
> me while I do that...
>
> When you say:
>    The LCF solr output connection doesn't yet do this, but it is trivial
> for me to make that happen.
> Do you mean a mechanism by which solr.war can get url et al info from its
> parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>
>
> Thanks,
> Peter
>
>
>
>
> On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com> wrote:
>
>>  Hi Peter,
>>
>> I'm the principal committer for LCF, but I don't know as much about Solr
>> as I ought to, so it sounds like a potentially productive collaboration.
>>
>> LCF does exactly what you are looking for - the only issue at all is that
>> you need to fetch a URL from a webapp to get what you are looking for.  The
>> "plugs" are all inside LCF for different kinds of repositories.  Here's a
>> link that might help with drinking the LCF "koolaid", as it were:
>> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>>
>> The url would be something like this (on a locally installed tomcat-based
>> LCF instance):
>>
>>
>> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>>
>> ... and this fetch returns something like:
>>
>> TOKEN:xxxxxxx
>> TOKEN:yyyyyyy
>> TOKEN:zzzzzzz
>> ....
>>
>> ... which represent the amalgamated tokens for all of the defined
>> authorities, and by some strange coincidence ( ;-) ) are compatible
>> with certain pieces of metadata that have been passed into Solr with each
>> document - one set of Allow tokens, and a second set of Deny tokens.  The
>> LCF solr output connection doesn't yet do this, but it is trivial for me to
>> make that happen.
>>
>> Does this sound plausible to you?
>>
>> Karl
>>
>>
>>  ------------------------------
>>  *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> *Sent:* Tuesday, April 20, 2010 5:41 PM
>> *To:* connectors-user@incubator.apache.org; dev@lucene.apache.org
>>
>> *Subject:* Re: FW: Solr and LCF security at query time
>>
>>   Hi Karl,
>>
>> Integrating LCF to get external token support for SOLR-1872 sounds very
>> interesting indeed. I don't know anything about LCF, but one of the things I
>> was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
>> 'pluggable' - i.e. it would just be one of a series of plugins that could be
>> used for obtaining back-end authentication information.
>>
>> If you're good with LCF, perhaps we could work together to build this in.
>> One of the first things would be defining an interface that would be as easy
>> as possible to plug LCF into. Have you any suggestions/insight on this
>> front?
>>
>> Many thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com> wrote:
>>
>>>  SOLR-1872 looks exactly like what I was envisioning, from the search
>>> query perspective, although instead of the acl xml file you specify LCF
>>> stipulates you would dynamically query the lcf-authority-service servlet for
>>> the access tokens themselves.  That would get you support for AD,
>>> Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this
>>> component could be modified to work with LCF with minor effort.
>>>
>>> The missing component still seems to be AD authentication, which needs a
>>> solution.
>>>
>>> Karl
>>>
>>>  ------------------------------
>>> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>>> *Sent:* Tuesday, April 20, 2010 10:44 AM
>>> *To:* dev@lucene.apache.org
>>> *Subject:* Re: FW: Solr and LCF security at query time
>>>
>>>   If you want to do this completely within Solr, have a look at:
>>> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>>
>>> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com> wrote:
>>>
>>>>  FYI
>>>>
>>>>  ------------------------------
>>>> *From:* Wright Karl (Nokia-S/Cambridge)
>>>> *Sent:* Tuesday, April 20, 2010 8:16 AM
>>>> *To:* 'dominique.bejean@eolya.fr'
>>>> *Cc:* 'solr-dev@apache.org'; 'connectors-dev@incubator.apache.org'; '
>>>> connectors-user@incubator.apache.org'
>>>> *Subject:* RE: Solr and LCF security at query time
>>>>
>>>>   Dominique,
>>>>
>>>> Yes, I am aware of this ticket and contribution.  Luckily LCF
>>>> establishes a powerful multi-repository security model, even though it
>>>> doesn't yet do the final step of enforcing that model at the search end.
>>>> LCF allows you to define multiple authorities to operate against disparate
>>>> repositories, and use the appropriate authority to secure any given
>>>> document.  The solr people are aware of this design, which addresses the
>>>> issues raised by SOLR-1834 very nicely.  However, as I said before, time is
>>>> a problem, and the work still needs to be done.
>>>>
>>>> I suggest you read up on the actual security model of LCF, and perhaps
>>>> experiment with that and the SOLR-1834 contribution, to see if there is
>>>> common ground.  One thing we've learned at MetaCarta is that post-filtering
>>>> for security purposes is expensive, and it is better to modify the queries
>>>> themselves to restrict the results, if possible.  I'm not sure which
>>>> approach SOLR-1834 takes, although it sounds like it might be the filtering
>>>> approach.  Still, it would be better than nothing.
>>>>
>>>> Please let me know what you find out.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>  ------------------------------
>>>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>>>> *Sent:* Tuesday, April 20, 2010 8:03 AM
>>>> *To:* Wright Karl (Nokia-S/Cambridge)
>>>> *Cc:* connectors-user@incubator.apache.org;
>>>> connectors-dev@incubator.apache.org
>>>> *Subject:* Re: Solr and LCF security at query time
>>>>
>>>> Karl,
>>>>
>>>> Thank you for your reply.
>>>>
>>>> I made some research today and I found this :
>>>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
>>>> http://demo.findwise.se:8880/SolrSecurity/
>>>>
>>>> Sorl security model have to be able to filter result list with items
>>>> coming from various sources at the same time (livelink, documentum, file
>>>> system, ...). Big subject :)
>>>>
>>>> Dominique
>>>>
>>>>
>>>> Le 20/04/10 13:34, karl.wright@nokia.com a écrit :
>>>>
>>>> Hi Dominique,
>>>>
>>>> At the moment, in order to enforce the LCF security model within
>>>> Lucene/Solr, you will need to build this functionality into whatever client
>>>> you are using to display the Lucene search results.  Specifically, you would
>>>> need to take the following steps:
>>>>
>>>> (1) Have your users access your search client through Apache.
>>>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>>>> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
>>>> the client webapp.
>>>> (3) Have your client webapp alter whatever queries it is doing, to add
>>>> an appropriate query clause for each of the access tokens transmitted in the
>>>> headers.
>>>>
>>>> (This is how it is done at MetaCarta.)
>>>>
>>>> Alternatively, you may find a way to do this completely with a web
>>>> application under a Java app server such as Tomcat.  I have not yet done the
>>>> research to find out whether this is a feasible alternative.  Effectively,
>>>> what you need something like mod_auth_kerb to do is to authenticate your
>>>> user against Active Directory, or whomever the authenticator ought to be.
>>>> JAAS may be helpful here.
>>>>
>>>> There are, of course, intentions to fill out the missing pieces more
>>>> completely and transparently via a Solr search plugin and/or filter.  What
>>>> has been lacking is time.  If you are in a position to do development in
>>>> this area, we're happy to have any assistance you might provide.
>>>>
>>>> Thanks,
>>>> Karl
>>>>  ------------------------------
>>>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<do...@eolya.fr>]
>>>>
>>>> *Sent:* Tuesday, April 20, 2010 5:06 AM
>>>> *To:* connectors-user@incubator.apache.org
>>>> *Subject:* Solr and LCF security at query time
>>>>
>>>> Hi,
>>>>
>>>> I don't see in LCF wiki how Solr and LCF works together at query time in
>>>> order to remove from the result list the items the user is not allowed to
>>>> access.
>>>>
>>>> In
>>>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
>>>> I just see these sentences :
>>>>
>>>> " Once all these documents and their access tokens are handed to the
>>>> search engine, it is the search engine's job to enforce security by
>>>> excluding inappropriate documents from the search results. For *Lucene*,
>>>> this infrastructure is expected to be built on top of Lucene's generic
>>>> metadata abilities, but has not been implemented at this time."
>>>>
>>>> I am not sure to understand. Does this mean that for the moment, it is
>>>> not possible for Solr to apply security by using an Authority Connector ?
>>>>
>>>> Dominique
>>>>
>>>>
>>>
>>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Thanks for the quick turnaround.
I'm in the middle of a product release for us, so I fear I won't be as quick
as you... :-)

I couldn't find a simple flow diagram or similar for LCF with regards
security (probably looking in the wrong place).
Perhaps you could help on these questions...?

In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries,
which are then used as filter queries in a user's search.

Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD
needs its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights
in order to cater for group membership allows and denies.

I guess I'm just trying to understand the architectural
flow/storage/retrieval of data in the various parts of the system, but I
admit, I need to do more research on this.
After our product release, when I get a few more spare cycles, I can look at
it in more detail.

Many thanks!
Peter



On Thu, Apr 22, 2010 at 1:02 AM, <ka...@nokia.com> wrote:

>  Hi Peter,
>
> I just committed the promised changes to the LCF Solr output connector.
>
> ACL metadata will now be posted to the Solr Http interface along with the
> document as the two following fields:
>
> __ACCESS_TOKEN__document
> __DENY_TOKEN__document
>
> There will, of course, potentially be multiple values for each of these two
> fields.
>
> Hope this helps,
> Karl
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Tuesday, April 20, 2010 6:51 PM
>
> *To:* connectors-user@incubator.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Thanks for the info. I'll have a look at the link and try to take in as
> much sugar as my insulin levels will handle...
> It sounds like the necessary interface(s) are already in LCF - just a
> matter of implementing them in the Solr 1872 plugin.
> I'll need to digest the LCF stuff to get to grips with it..please bear with
> me while I do that...
>
> When you say:
>    The LCF solr output connection doesn't yet do this, but it is trivial
> for me to make that happen.
> Do you mean a mechanism by which solr.war can get url et al info from its
> parent container (Tomcat, Jetty etc.), or have I misinterpreted this?
>
>
> Thanks,
> Peter
>
>
>
>
> On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com> wrote:
>
>>  Hi Peter,
>>
>> I'm the principal committer for LCF, but I don't know as much about Solr
>> as I ought to, so it sounds like a potentially productive collaboration.
>>
>> LCF does exactly what you are looking for - the only issue at all is that
>> you need to fetch a URL from a webapp to get what you are looking for.  The
>> "plugs" are all inside LCF for different kinds of repositories.  Here's a
>> link that might help with drinking the LCF "koolaid", as it were:
>> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>>
>> The url would be something like this (on a locally installed tomcat-based
>> LCF instance):
>>
>>
>> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>>
>> ... and this fetch returns something like:
>>
>> TOKEN:xxxxxxx
>> TOKEN:yyyyyyy
>> TOKEN:zzzzzzz
>> ....
>>
>> ... which represent the amalgamated tokens for all of the defined
>> authorities, and by some strange coincidence ( ;-) ) are compatible
>> with certain pieces of metadata that have been passed into Solr with each
>> document - one set of Allow tokens, and a second set of Deny tokens.  The
>> LCF solr output connection doesn't yet do this, but it is trivial for me to
>> make that happen.
>>
>> Does this sound plausible to you?
>>
>> Karl
>>
>>
>>  ------------------------------
>>  *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> *Sent:* Tuesday, April 20, 2010 5:41 PM
>> *To:* connectors-user@incubator.apache.org; dev@lucene.apache.org
>>
>> *Subject:* Re: FW: Solr and LCF security at query time
>>
>>   Hi Karl,
>>
>> Integrating LCF to get external token support for SOLR-1872 sounds very
>> interesting indeed. I don't know anything about LCF, but one of the things I
>> was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
>> 'pluggable' - i.e. it would just be one of a series of plugins that could be
>> used for obtaining back-end authentication information.
>>
>> If you're good with LCF, perhaps we could work together to build this in.
>> One of the first things would be defining an interface that would be as easy
>> as possible to plug LCF into. Have you any suggestions/insight on this
>> front?
>>
>> Many thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com> wrote:
>>
>>>  SOLR-1872 looks exactly like what I was envisioning, from the search
>>> query perspective, although instead of the acl xml file you specify LCF
>>> stipulates you would dynamically query the lcf-authority-service servlet for
>>> the access tokens themselves.  That would get you support for AD,
>>> Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this
>>> component could be modified to work with LCF with minor effort.
>>>
>>> The missing component still seems to be AD authentication, which needs a
>>> solution.
>>>
>>> Karl
>>>
>>>  ------------------------------
>>> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>>> *Sent:* Tuesday, April 20, 2010 10:44 AM
>>> *To:* dev@lucene.apache.org
>>> *Subject:* Re: FW: Solr and LCF security at query time
>>>
>>>   If you want to do this completely within Solr, have a look at:
>>> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>>
>>> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com> wrote:
>>>
>>>>  FYI
>>>>
>>>>  ------------------------------
>>>> *From:* Wright Karl (Nokia-S/Cambridge)
>>>> *Sent:* Tuesday, April 20, 2010 8:16 AM
>>>> *To:* 'dominique.bejean@eolya.fr'
>>>> *Cc:* 'solr-dev@apache.org'; 'connectors-dev@incubator.apache.org'; '
>>>> connectors-user@incubator.apache.org'
>>>> *Subject:* RE: Solr and LCF security at query time
>>>>
>>>>   Dominique,
>>>>
>>>> Yes, I am aware of this ticket and contribution.  Luckily LCF
>>>> establishes a powerful multi-repository security model, even though it
>>>> doesn't yet do the final step of enforcing that model at the search end.
>>>> LCF allows you to define multiple authorities to operate against disparate
>>>> repositories, and use the appropriate authority to secure any given
>>>> document.  The solr people are aware of this design, which addresses the
>>>> issues raised by SOLR-1834 very nicely.  However, as I said before, time is
>>>> a problem, and the work still needs to be done.
>>>>
>>>> I suggest you read up on the actual security model of LCF, and perhaps
>>>> experiment with that and the SOLR-1834 contribution, to see if there is
>>>> common ground.  One thing we've learned at MetaCarta is that post-filtering
>>>> for security purposes is expensive, and it is better to modify the queries
>>>> themselves to restrict the results, if possible.  I'm not sure which
>>>> approach SOLR-1834 takes, although it sounds like it might be the filtering
>>>> approach.  Still, it would be better than nothing.
>>>>
>>>> Please let me know what you find out.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>  ------------------------------
>>>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>>>> *Sent:* Tuesday, April 20, 2010 8:03 AM
>>>> *To:* Wright Karl (Nokia-S/Cambridge)
>>>> *Cc:* connectors-user@incubator.apache.org;
>>>> connectors-dev@incubator.apache.org
>>>> *Subject:* Re: Solr and LCF security at query time
>>>>
>>>> Karl,
>>>>
>>>> Thank you for your reply.
>>>>
>>>> I made some research today and I found this :
>>>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
>>>> http://demo.findwise.se:8880/SolrSecurity/
>>>>
>>>> Sorl security model have to be able to filter result list with items
>>>> coming from various sources at the same time (livelink, documentum, file
>>>> system, ...). Big subject :)
>>>>
>>>> Dominique
>>>>
>>>>
>>>> Le 20/04/10 13:34, karl.wright@nokia.com a écrit :
>>>>
>>>> Hi Dominique,
>>>>
>>>> At the moment, in order to enforce the LCF security model within
>>>> Lucene/Solr, you will need to build this functionality into whatever client
>>>> you are using to display the Lucene search results.  Specifically, you would
>>>> need to take the following steps:
>>>>
>>>> (1) Have your users access your search client through Apache.
>>>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>>>> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
>>>> the client webapp.
>>>> (3) Have your client webapp alter whatever queries it is doing, to add
>>>> an appropriate query clause for each of the access tokens transmitted in the
>>>> headers.
>>>>
>>>> (This is how it is done at MetaCarta.)
>>>>
>>>> Alternatively, you may find a way to do this completely with a web
>>>> application under a Java app server such as Tomcat.  I have not yet done the
>>>> research to find out whether this is a feasible alternative.  Effectively,
>>>> what you need something like mod_auth_kerb to do is to authenticate your
>>>> user against Active Directory, or whomever the authenticator ought to be.
>>>> JAAS may be helpful here.
>>>>
>>>> There are, of course, intentions to fill out the missing pieces more
>>>> completely and transparently via a Solr search plugin and/or filter.  What
>>>> has been lacking is time.  If you are in a position to do development in
>>>> this area, we're happy to have any assistance you might provide.
>>>>
>>>> Thanks,
>>>> Karl
>>>>  ------------------------------
>>>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<do...@eolya.fr>]
>>>>
>>>> *Sent:* Tuesday, April 20, 2010 5:06 AM
>>>> *To:* connectors-user@incubator.apache.org
>>>> *Subject:* Solr and LCF security at query time
>>>>
>>>> Hi,
>>>>
>>>> I don't see in LCF wiki how Solr and LCF works together at query time in
>>>> order to remove from the result list the items the user is not allowed to
>>>> access.
>>>>
>>>> In
>>>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
>>>> I just see these sentences :
>>>>
>>>> " Once all these documents and their access tokens are handed to the
>>>> search engine, it is the search engine's job to enforce security by
>>>> excluding inappropriate documents from the search results. For *Lucene*,
>>>> this infrastructure is expected to be built on top of Lucene's generic
>>>> metadata abilities, but has not been implemented at this time."
>>>>
>>>> I am not sure to understand. Does this mean that for the moment, it is
>>>> not possible for Solr to apply security by using an Authority Connector ?
>>>>
>>>> Dominique
>>>>
>>>>
>>>
>>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 6:51 PM
To: connectors-user@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique




RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 6:51 PM
To: connectors-user@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique




RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two fields.

Hope this helps,
Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 6:51 PM
To: connectors-user@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com>> wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; dev@lucene.apache.org<ma...@lucene.apache.org>

Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique




Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much
sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter
of implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with
me while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for
me to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its
parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, <ka...@nokia.com> wrote:

>  Hi Peter,
>
> I'm the principal committer for LCF, but I don't know as much about Solr as
> I ought to, so it sounds like a potentially productive collaboration.
>
> LCF does exactly what you are looking for - the only issue at all is that
> you need to fetch a URL from a webapp to get what you are looking for.  The
> "plugs" are all inside LCF for different kinds of repositories.  Here's a
> link that might help with drinking the LCF "koolaid", as it were:
> https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts
>
> The url would be something like this (on a locally installed tomcat-based
> LCF instance):
>
>
> http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com
>
> ... and this fetch returns something like:
>
> TOKEN:xxxxxxx
> TOKEN:yyyyyyy
> TOKEN:zzzzzzz
> ....
>
> ... which represent the amalgamated tokens for all of the defined
> authorities, and by some strange coincidence ( ;-) ) are compatible
> with certain pieces of metadata that have been passed into Solr with each
> document - one set of Allow tokens, and a second set of Deny tokens.  The
> LCF solr output connection doesn't yet do this, but it is trivial for me to
> make that happen.
>
> Does this sound plausible to you?
>
> Karl
>
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Tuesday, April 20, 2010 5:41 PM
> *To:* connectors-user@incubator.apache.org; dev@lucene.apache.org
>
> *Subject:* Re: FW: Solr and LCF security at query time
>
> Hi Karl,
>
> Integrating LCF to get external token support for SOLR-1872 sounds very
> interesting indeed. I don't know anything about LCF, but one of the things I
> was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
> 'pluggable' - i.e. it would just be one of a series of plugins that could be
> used for obtaining back-end authentication information.
>
> If you're good with LCF, perhaps we could work together to build this in.
> One of the first things would be defining an interface that would be as easy
> as possible to plug LCF into. Have you any suggestions/insight on this
> front?
>
> Many thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com> wrote:
>
>>  SOLR-1872 looks exactly like what I was envisioning, from the search
>> query perspective, although instead of the acl xml file you specify LCF
>> stipulates you would dynamically query the lcf-authority-service servlet for
>> the access tokens themselves.  That would get you support for AD,
>> Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this
>> component could be modified to work with LCF with minor effort.
>>
>> The missing component still seems to be AD authentication, which needs a
>> solution.
>>
>> Karl
>>
>>  ------------------------------
>> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
>> *Sent:* Tuesday, April 20, 2010 10:44 AM
>> *To:* dev@lucene.apache.org
>> *Subject:* Re: FW: Solr and LCF security at query time
>>
>>   If you want to do this completely within Solr, have a look at:
>> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>>
>> Thanks,
>> Peter
>>
>>
>>
>> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com> wrote:
>>
>>>  FYI
>>>
>>>  ------------------------------
>>> *From:* Wright Karl (Nokia-S/Cambridge)
>>> *Sent:* Tuesday, April 20, 2010 8:16 AM
>>> *To:* 'dominique.bejean@eolya.fr'
>>> *Cc:* 'solr-dev@apache.org'; 'connectors-dev@incubator.apache.org'; '
>>> connectors-user@incubator.apache.org'
>>> *Subject:* RE: Solr and LCF security at query time
>>>
>>>   Dominique,
>>>
>>> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes
>>> a powerful multi-repository security model, even though it doesn't yet do
>>> the final step of enforcing that model at the search end.  LCF allows you to
>>> define multiple authorities to operate against disparate repositories, and
>>> use the appropriate authority to secure any given document.  The solr people
>>> are aware of this design, which addresses the issues raised by SOLR-1834
>>> very nicely.  However, as I said before, time is a problem, and the work
>>> still needs to be done.
>>>
>>> I suggest you read up on the actual security model of LCF, and perhaps
>>> experiment with that and the SOLR-1834 contribution, to see if there is
>>> common ground.  One thing we've learned at MetaCarta is that post-filtering
>>> for security purposes is expensive, and it is better to modify the queries
>>> themselves to restrict the results, if possible.  I'm not sure which
>>> approach SOLR-1834 takes, although it sounds like it might be the filtering
>>> approach.  Still, it would be better than nothing.
>>>
>>> Please let me know what you find out.
>>>
>>> Thanks,
>>> Karl
>>>
>>>  ------------------------------
>>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>>> *Sent:* Tuesday, April 20, 2010 8:03 AM
>>> *To:* Wright Karl (Nokia-S/Cambridge)
>>> *Cc:* connectors-user@incubator.apache.org;
>>> connectors-dev@incubator.apache.org
>>> *Subject:* Re: Solr and LCF security at query time
>>>
>>> Karl,
>>>
>>> Thank you for your reply.
>>>
>>> I made some research today and I found this :
>>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
>>> http://demo.findwise.se:8880/SolrSecurity/
>>>
>>> Sorl security model have to be able to filter result list with items
>>> coming from various sources at the same time (livelink, documentum, file
>>> system, ...). Big subject :)
>>>
>>> Dominique
>>>
>>>
>>> Le 20/04/10 13:34, karl.wright@nokia.com a écrit :
>>>
>>> Hi Dominique,
>>>
>>> At the moment, in order to enforce the LCF security model within
>>> Lucene/Solr, you will need to build this functionality into whatever client
>>> you are using to display the Lucene search results.  Specifically, you would
>>> need to take the following steps:
>>>
>>> (1) Have your users access your search client through Apache.
>>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>>> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
>>> the client webapp.
>>> (3) Have your client webapp alter whatever queries it is doing, to add an
>>> appropriate query clause for each of the access tokens transmitted in the
>>> headers.
>>>
>>> (This is how it is done at MetaCarta.)
>>>
>>> Alternatively, you may find a way to do this completely with a web
>>> application under a Java app server such as Tomcat.  I have not yet done the
>>> research to find out whether this is a feasible alternative.  Effectively,
>>> what you need something like mod_auth_kerb to do is to authenticate your
>>> user against Active Directory, or whomever the authenticator ought to be.
>>> JAAS may be helpful here.
>>>
>>> There are, of course, intentions to fill out the missing pieces more
>>> completely and transparently via a Solr search plugin and/or filter.  What
>>> has been lacking is time.  If you are in a position to do development in
>>> this area, we're happy to have any assistance you might provide.
>>>
>>> Thanks,
>>> Karl
>>>  ------------------------------
>>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<do...@eolya.fr>]
>>>
>>> *Sent:* Tuesday, April 20, 2010 5:06 AM
>>> *To:* connectors-user@incubator.apache.org
>>> *Subject:* Solr and LCF security at query time
>>>
>>> Hi,
>>>
>>> I don't see in LCF wiki how Solr and LCF works together at query time in
>>> order to remove from the result list the items the user is not allowed to
>>> access.
>>>
>>> In
>>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
>>> I just see these sentences :
>>>
>>> " Once all these documents and their access tokens are handed to the
>>> search engine, it is the search engine's job to enforce security by
>>> excluding inappropriate documents from the search results. For *Lucene*,
>>> this infrastructure is expected to be built on top of Lucene's generic
>>> metadata abilities, but has not been implemented at this time."
>>>
>>> I am not sure to understand. Does this mean that for the moment, it is
>>> not possible for Solr to apply security by using an Authority Connector ?
>>>
>>> Dominique
>>>
>>>
>>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org; dev@lucene.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique



RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you need to fetch a URL from a webapp to get what you are looking for.  The "plugs" are all inside LCF for different kinds of repositories.  Here's a link that might help with drinking the LCF "koolaid", as it were: https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF instance):

http://localhost:8080/lcf-authority-service/UserACLs?username=someusername@somedomain.com

... and this fetch returns something like:

TOKEN:xxxxxxx
TOKEN:yyyyyyy
TOKEN:zzzzzzz
....

... which represent the amalgamated tokens for all of the defined authorities, and by some strange coincidence ( ;-) ) are compatible with certain pieces of metadata that have been passed into Solr with each document - one set of Allow tokens, and a second set of Deny tokens.  The LCF solr output connection doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl


________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 5:41 PM
To: connectors-user@incubator.apache.org; dev@lucene.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very interesting indeed. I don't know anything about LCF, but one of the things I was planning for SOLR-1872 is to make acl.xml (or rather its behaviour) 'pluggable' - i.e. it would just be one of a series of plugins that could be used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in. One of the first things would be defining an interface that would be as easy as possible to plug LCF into. Have you any suggestions/insight on this front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com>> wrote:
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com<ma...@googlemail.com>]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique



Re: FW: Solr and LCF security at query time

Posted by Dominique Bejean <do...@eolya.fr>.
I am happy to see that my original question generate so much activity :) 
Go on guys !

Le 20/04/10 23:40, Peter Sturge a écrit :
> Hi Karl,
>
> Integrating LCF to get external token support for SOLR-1872 sounds 
> very interesting indeed. I don't know anything about LCF, but one of 
> the things I was planning for SOLR-1872 is to make acl.xml (or rather 
> its behaviour) 'pluggable' - i.e. it would just be one of a series of 
> plugins that could be used for obtaining back-end authentication 
> information.
>
> If you're good with LCF, perhaps we could work together to build this 
> in. One of the first things would be defining an interface that would 
> be as easy as possible to plug LCF into. Have you any 
> suggestions/insight on this front?
>
> Many thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 4:08 PM, <karl.wright@nokia.com 
> <ma...@nokia.com>> wrote:
>
>     SOLR-1872 looks exactly like what I was envisioning, from the
>     search query perspective, although instead of the acl xml file you
>     specify LCF stipulates you would dynamically query the
>     lcf-authority-service servlet for the access tokens themselves. 
>     That would get you support for AD, Documentum, LiveLink, Meridio,
>     and Memex for free. It seems likely that this component could be
>     modified to work with LCF with minor effort.
>     The missing component still seems to be AD authentication, which
>     needs a solution.
>     Karl
>
>     ------------------------------------------------------------------------
>     *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com
>     <ma...@googlemail.com>]
>     *Sent:* Tuesday, April 20, 2010 10:44 AM
>     *To:* dev@lucene.apache.org <ma...@lucene.apache.org>
>     *Subject:* Re: FW: Solr and LCF security at query time
>
>     If you want to do this completely within Solr, have a look at:
>     SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>
>     Thanks,
>     Peter
>
>
>
>     On Tue, Apr 20, 2010 at 1:25 PM, <karl.wright@nokia.com
>     <ma...@nokia.com>> wrote:
>
>         FYI
>
>         ------------------------------------------------------------------------
>         *From:* Wright Karl (Nokia-S/Cambridge)
>         *Sent:* Tuesday, April 20, 2010 8:16 AM
>         *To:* 'dominique.bejean@eolya.fr
>         <ma...@eolya.fr>'
>         *Cc:* 'solr-dev@apache.org <ma...@apache.org>';
>         'connectors-dev@incubator.apache.org
>         <ma...@incubator.apache.org>';
>         'connectors-user@incubator.apache.org
>         <ma...@incubator.apache.org>'
>         *Subject:* RE: Solr and LCF security at query time
>
>         Dominique,
>         Yes, I am aware of this ticket and contribution.  Luckily LCF
>         establishes a powerful multi-repository security model, even
>         though it doesn't yet do the final step of enforcing that
>         model at the search end.  LCF allows you to define multiple
>         authorities to operate against disparate repositories, and use
>         the appropriate authority to secure any given document.  The
>         solr people are aware of this design, which addresses the
>         issues raised by SOLR-1834 very nicely.  However, as I said
>         before, time is a problem, and the work still needs to be done.
>         I suggest you read up on the actual security model of LCF, and
>         perhaps experiment with that and the SOLR-1834 contribution,
>         to see if there is common ground.  One thing we've learned at
>         MetaCarta is that post-filtering for security purposes is
>         expensive, and it is better to modify the queries themselves
>         to restrict the results, if possible.  I'm not sure which
>         approach SOLR-1834 takes, although it sounds like it might be
>         the filtering approach.  Still, it would be better than nothing.
>         Please let me know what you find out.
>         Thanks,
>         Karl
>
>         ------------------------------------------------------------------------
>         *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr
>         <ma...@eolya.fr>]
>         *Sent:* Tuesday, April 20, 2010 8:03 AM
>         *To:* Wright Karl (Nokia-S/Cambridge)
>         *Cc:* connectors-user@incubator.apache.org
>         <ma...@incubator.apache.org>;
>         connectors-dev@incubator.apache.org
>         <ma...@incubator.apache.org>
>         *Subject:* Re: Solr and LCF security at query time
>
>         Karl,
>
>         Thank you for your reply.
>
>         I made some research today and I found this :
>         http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
>         http://demo.findwise.se:8880/SolrSecurity/
>
>         Sorl security model have to be able to filter result list with
>         items coming from various sources at the same time (livelink,
>         documentum, file system, ...). Big subject :)
>
>         Dominique
>
>
>         Le 20/04/10 13:34, karl.wright@nokia.com
>         <ma...@nokia.com> a écrit :
>>         Hi Dominique,
>>         At the moment, in order to enforce the LCF security model
>>         within Lucene/Solr, you will need to build this
>>         functionality into whatever client you are using to display
>>         the Lucene search results.  Specifically, you would need to
>>         take the following steps:
>>         (1) Have your users access your search client through Apache.
>>         (2) Use the Apache module mod_auth_kerb, combined with LCF's
>>         mod_authz_annotate, to cause authorization HTTP headers to be
>>         transmitted to the client webapp.
>>         (3) Have your client webapp alter whatever queries it is
>>         doing, to add an appropriate query clause for each of the
>>         access tokens transmitted in the headers.
>>         (This is how it is done at MetaCarta.)
>>         Alternatively, you may find a way to do this completely with
>>         a web application under a Java app server such as Tomcat.  I
>>         have not yet done the research to find out whether this is a
>>         feasible alternative.  Effectively, what you need something
>>         like mod_auth_kerb to do is to authenticate your user against
>>         Active Directory, or whomever the authenticator ought to be. 
>>         JAAS may be helpful here.
>>         There are, of course, intentions to fill out the missing
>>         pieces more completely and transparently via a Solr search
>>         plugin and/or filter.  What has been lacking is time.  If you
>>         are in a position to do development in this area, we're happy
>>         to have any assistance you might provide.
>>         Thanks,
>>         Karl
>>         ------------------------------------------------------------------------
>>         *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>>         *Sent:* Tuesday, April 20, 2010 5:06 AM
>>         *To:* connectors-user@incubator.apache.org
>>         <ma...@incubator.apache.org>
>>         *Subject:* Solr and LCF security at query time
>>
>>         Hi,
>>
>>         I don't see in LCF wiki how Solr and LCF works together at
>>         query time in order to remove from the result list the items
>>         the user is not allowed to access.
>>
>>         In
>>         http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
>>         I just see these sentences :
>>
>>         " Once all these documents and their access tokens are handed
>>         to the search engine, it is the search engine's job to
>>         enforce security by excluding inappropriate documents from
>>         the search results. For *Lucene*, this infrastructure is
>>         expected to be built on top of Lucene's generic metadata
>>         abilities, but has not been implemented at this time."
>>
>>         I am not sure to understand. Does this mean that for the
>>         moment, it is not possible for Solr to apply security by
>>         using an Authority Connector ?
>>
>>         Dominique
>
>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very
interesting indeed. I don't know anything about LCF, but one of the things I
was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
'pluggable' - i.e. it would just be one of a series of plugins that could be
used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in.
One of the first things would be defining an interface that would be as easy
as possible to plug LCF into. Have you any suggestions/insight on this
front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com> wrote:

>  SOLR-1872 looks exactly like what I was envisioning, from the search
> query perspective, although instead of the acl xml file you specify LCF
> stipulates you would dynamically query the lcf-authority-service servlet for
> the access tokens themselves.  That would get you support for AD,
> Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this
> component could be modified to work with LCF with minor effort.
>
> The missing component still seems to be AD authentication, which needs a
> solution.
>
> Karl
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Tuesday, April 20, 2010 10:44 AM
> *To:* dev@lucene.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> If you want to do this completely within Solr, have a look at:
> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>
> Thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com> wrote:
>
>>  FYI
>>
>>  ------------------------------
>> *From:* Wright Karl (Nokia-S/Cambridge)
>> *Sent:* Tuesday, April 20, 2010 8:16 AM
>> *To:* 'dominique.bejean@eolya.fr'
>> *Cc:* 'solr-dev@apache.org'; 'connectors-dev@incubator.apache.org'; '
>> connectors-user@incubator.apache.org'
>> *Subject:* RE: Solr and LCF security at query time
>>
>>   Dominique,
>>
>> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes
>> a powerful multi-repository security model, even though it doesn't yet do
>> the final step of enforcing that model at the search end.  LCF allows you to
>> define multiple authorities to operate against disparate repositories, and
>> use the appropriate authority to secure any given document.  The solr people
>> are aware of this design, which addresses the issues raised by SOLR-1834
>> very nicely.  However, as I said before, time is a problem, and the work
>> still needs to be done.
>>
>> I suggest you read up on the actual security model of LCF, and perhaps
>> experiment with that and the SOLR-1834 contribution, to see if there is
>> common ground.  One thing we've learned at MetaCarta is that post-filtering
>> for security purposes is expensive, and it is better to modify the queries
>> themselves to restrict the results, if possible.  I'm not sure which
>> approach SOLR-1834 takes, although it sounds like it might be the filtering
>> approach.  Still, it would be better than nothing.
>>
>> Please let me know what you find out.
>>
>> Thanks,
>> Karl
>>
>>  ------------------------------
>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> *Sent:* Tuesday, April 20, 2010 8:03 AM
>> *To:* Wright Karl (Nokia-S/Cambridge)
>> *Cc:* connectors-user@incubator.apache.org;
>> connectors-dev@incubator.apache.org
>> *Subject:* Re: Solr and LCF security at query time
>>
>> Karl,
>>
>> Thank you for your reply.
>>
>> I made some research today and I found this :
>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
>> http://demo.findwise.se:8880/SolrSecurity/
>>
>> Sorl security model have to be able to filter result list with items
>> coming from various sources at the same time (livelink, documentum, file
>> system, ...). Big subject :)
>>
>> Dominique
>>
>>
>> Le 20/04/10 13:34, karl.wright@nokia.com a écrit :
>>
>> Hi Dominique,
>>
>> At the moment, in order to enforce the LCF security model within
>> Lucene/Solr, you will need to build this functionality into whatever client
>> you are using to display the Lucene search results.  Specifically, you would
>> need to take the following steps:
>>
>> (1) Have your users access your search client through Apache.
>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
>> the client webapp.
>> (3) Have your client webapp alter whatever queries it is doing, to add an
>> appropriate query clause for each of the access tokens transmitted in the
>> headers.
>>
>> (This is how it is done at MetaCarta.)
>>
>> Alternatively, you may find a way to do this completely with a web
>> application under a Java app server such as Tomcat.  I have not yet done the
>> research to find out whether this is a feasible alternative.  Effectively,
>> what you need something like mod_auth_kerb to do is to authenticate your
>> user against Active Directory, or whomever the authenticator ought to be.
>> JAAS may be helpful here.
>>
>> There are, of course, intentions to fill out the missing pieces more
>> completely and transparently via a Solr search plugin and/or filter.  What
>> has been lacking is time.  If you are in a position to do development in
>> this area, we're happy to have any assistance you might provide.
>>
>> Thanks,
>> Karl
>>  ------------------------------
>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<do...@eolya.fr>]
>>
>> *Sent:* Tuesday, April 20, 2010 5:06 AM
>> *To:* connectors-user@incubator.apache.org
>> *Subject:* Solr and LCF security at query time
>>
>> Hi,
>>
>> I don't see in LCF wiki how Solr and LCF works together at query time in
>> order to remove from the result list the items the user is not allowed to
>> access.
>>
>> In
>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
>> I just see these sentences :
>>
>> " Once all these documents and their access tokens are handed to the
>> search engine, it is the search engine's job to enforce security by
>> excluding inappropriate documents from the search results. For *Lucene*,
>> this infrastructure is expected to be built on top of Lucene's generic
>> metadata abilities, but has not been implemented at this time."
>>
>> I am not sure to understand. Does this mean that for the moment, it is not
>> possible for Solr to apply security by using an Authority Connector ?
>>
>> Dominique
>>
>>
>

Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
Hi Karl,

Integrating LCF to get external token support for SOLR-1872 sounds very
interesting indeed. I don't know anything about LCF, but one of the things I
was planning for SOLR-1872 is to make acl.xml (or rather its behaviour)
'pluggable' - i.e. it would just be one of a series of plugins that could be
used for obtaining back-end authentication information.

If you're good with LCF, perhaps we could work together to build this in.
One of the first things would be defining an interface that would be as easy
as possible to plug LCF into. Have you any suggestions/insight on this
front?

Many thanks,
Peter



On Tue, Apr 20, 2010 at 4:08 PM, <ka...@nokia.com> wrote:

>  SOLR-1872 looks exactly like what I was envisioning, from the search
> query perspective, although instead of the acl xml file you specify LCF
> stipulates you would dynamically query the lcf-authority-service servlet for
> the access tokens themselves.  That would get you support for AD,
> Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this
> component could be modified to work with LCF with minor effort.
>
> The missing component still seems to be AD authentication, which needs a
> solution.
>
> Karl
>
>  ------------------------------
> *From:* ext Peter Sturge [mailto:peter.sturge@googlemail.com]
> *Sent:* Tuesday, April 20, 2010 10:44 AM
> *To:* dev@lucene.apache.org
> *Subject:* Re: FW: Solr and LCF security at query time
>
> If you want to do this completely within Solr, have a look at:
> SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.
>
> Thanks,
> Peter
>
>
>
> On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com> wrote:
>
>>  FYI
>>
>>  ------------------------------
>> *From:* Wright Karl (Nokia-S/Cambridge)
>> *Sent:* Tuesday, April 20, 2010 8:16 AM
>> *To:* 'dominique.bejean@eolya.fr'
>> *Cc:* 'solr-dev@apache.org'; 'connectors-dev@incubator.apache.org'; '
>> connectors-user@incubator.apache.org'
>> *Subject:* RE: Solr and LCF security at query time
>>
>>   Dominique,
>>
>> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes
>> a powerful multi-repository security model, even though it doesn't yet do
>> the final step of enforcing that model at the search end.  LCF allows you to
>> define multiple authorities to operate against disparate repositories, and
>> use the appropriate authority to secure any given document.  The solr people
>> are aware of this design, which addresses the issues raised by SOLR-1834
>> very nicely.  However, as I said before, time is a problem, and the work
>> still needs to be done.
>>
>> I suggest you read up on the actual security model of LCF, and perhaps
>> experiment with that and the SOLR-1834 contribution, to see if there is
>> common ground.  One thing we've learned at MetaCarta is that post-filtering
>> for security purposes is expensive, and it is better to modify the queries
>> themselves to restrict the results, if possible.  I'm not sure which
>> approach SOLR-1834 takes, although it sounds like it might be the filtering
>> approach.  Still, it would be better than nothing.
>>
>> Please let me know what you find out.
>>
>> Thanks,
>> Karl
>>
>>  ------------------------------
>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
>> *Sent:* Tuesday, April 20, 2010 8:03 AM
>> *To:* Wright Karl (Nokia-S/Cambridge)
>> *Cc:* connectors-user@incubator.apache.org;
>> connectors-dev@incubator.apache.org
>> *Subject:* Re: Solr and LCF security at query time
>>
>> Karl,
>>
>> Thank you for your reply.
>>
>> I made some research today and I found this :
>> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
>> http://demo.findwise.se:8880/SolrSecurity/
>>
>> Sorl security model have to be able to filter result list with items
>> coming from various sources at the same time (livelink, documentum, file
>> system, ...). Big subject :)
>>
>> Dominique
>>
>>
>> Le 20/04/10 13:34, karl.wright@nokia.com a écrit :
>>
>> Hi Dominique,
>>
>> At the moment, in order to enforce the LCF security model within
>> Lucene/Solr, you will need to build this functionality into whatever client
>> you are using to display the Lucene search results.  Specifically, you would
>> need to take the following steps:
>>
>> (1) Have your users access your search client through Apache.
>> (2) Use the Apache module mod_auth_kerb, combined with LCF's
>> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
>> the client webapp.
>> (3) Have your client webapp alter whatever queries it is doing, to add an
>> appropriate query clause for each of the access tokens transmitted in the
>> headers.
>>
>> (This is how it is done at MetaCarta.)
>>
>> Alternatively, you may find a way to do this completely with a web
>> application under a Java app server such as Tomcat.  I have not yet done the
>> research to find out whether this is a feasible alternative.  Effectively,
>> what you need something like mod_auth_kerb to do is to authenticate your
>> user against Active Directory, or whomever the authenticator ought to be.
>> JAAS may be helpful here.
>>
>> There are, of course, intentions to fill out the missing pieces more
>> completely and transparently via a Solr search plugin and/or filter.  What
>> has been lacking is time.  If you are in a position to do development in
>> this area, we're happy to have any assistance you might provide.
>>
>> Thanks,
>> Karl
>>  ------------------------------
>> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<do...@eolya.fr>]
>>
>> *Sent:* Tuesday, April 20, 2010 5:06 AM
>> *To:* connectors-user@incubator.apache.org
>> *Subject:* Solr and LCF security at query time
>>
>> Hi,
>>
>> I don't see in LCF wiki how Solr and LCF works together at query time in
>> order to remove from the result list the items the user is not allowed to
>> access.
>>
>> In
>> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
>> I just see these sentences :
>>
>> " Once all these documents and their access tokens are handed to the
>> search engine, it is the search engine's job to enforce security by
>> excluding inappropriate documents from the search results. For *Lucene*,
>> this infrastructure is expected to be built on top of Lucene's generic
>> metadata abilities, but has not been implemented at this time."
>>
>> I am not sure to understand. Does this mean that for the moment, it is not
>> possible for Solr to apply security by using an Authority Connector ?
>>
>> Dominique
>>
>>
>

RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique


RE: FW: Solr and LCF security at query time

Posted by ka...@nokia.com.
SOLR-1872 looks exactly like what I was envisioning, from the search query perspective, although instead of the acl xml file you specify LCF stipulates you would dynamically query the lcf-authority-service servlet for the access tokens themselves.  That would get you support for AD, Documentum, LiveLink, Meridio, and Memex for free. It seems likely that this component could be modified to work with LCF with minor effort.

The missing component still seems to be AD authentication, which needs a solution.

Karl

________________________________
From: ext Peter Sturge [mailto:peter.sturge@googlemail.com]
Sent: Tuesday, April 20, 2010 10:44 AM
To: dev@lucene.apache.org
Subject: Re: FW: Solr and LCF security at query time

If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com>> wrote:
FYI

________________________________
From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 20, 2010 8:16 AM
To: 'dominique.bejean@eolya.fr<ma...@eolya.fr>'
Cc: 'solr-dev@apache.org<ma...@apache.org>'; 'connectors-dev@incubator.apache.org<ma...@incubator.apache.org>'; 'connectors-user@incubator.apache.org<ma...@incubator.apache.org>'
Subject: RE: Solr and LCF security at query time

Dominique,

Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a powerful multi-repository security model, even though it doesn't yet do the final step of enforcing that model at the search end.  LCF allows you to define multiple authorities to operate against disparate repositories, and use the appropriate authority to secure any given document.  The solr people are aware of this design, which addresses the issues raised by SOLR-1834 very nicely.  However, as I said before, time is a problem, and the work still needs to be done.

I suggest you read up on the actual security model of LCF, and perhaps experiment with that and the SOLR-1834 contribution, to see if there is common ground.  One thing we've learned at MetaCarta is that post-filtering for security purposes is expensive, and it is better to modify the queries themselves to restrict the results, if possible.  I'm not sure which approach SOLR-1834 takes, although it sounds like it might be the filtering approach.  Still, it would be better than nothing.

Please let me know what you find out.

Thanks,
Karl

________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<ma...@eolya.fr>]
Sent: Tuesday, April 20, 2010 8:03 AM
To: Wright Karl (Nokia-S/Cambridge)
Cc: connectors-user@incubator.apache.org<ma...@incubator.apache.org>; connectors-dev@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Solr and LCF security at query time

Karl,

Thank you for your reply.

I made some research today and I found this :
http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
http://demo.findwise.se:8880/SolrSecurity/

Sorl security model have to be able to filter result list with items coming from various sources at the same time (livelink, documentum, file system, ...). Big subject :)

Dominique


Le 20/04/10 13:34, karl.wright@nokia.com<ma...@nokia.com> a écrit :
Hi Dominique,

At the moment, in order to enforce the LCF security model within Lucene/Solr, you will need to build this functionality into whatever client you are using to display the Lucene search results.  Specifically, you would need to take the following steps:

(1) Have your users access your search client through Apache.
(2) Use the Apache module mod_auth_kerb, combined with LCF's mod_authz_annotate, to cause authorization HTTP headers to be transmitted to the client webapp.
(3) Have your client webapp alter whatever queries it is doing, to add an appropriate query clause for each of the access tokens transmitted in the headers.

(This is how it is done at MetaCarta.)

Alternatively, you may find a way to do this completely with a web application under a Java app server such as Tomcat.  I have not yet done the research to find out whether this is a feasible alternative.  Effectively, what you need something like mod_auth_kerb to do is to authenticate your user against Active Directory, or whomever the authenticator ought to be.  JAAS may be helpful here.

There are, of course, intentions to fill out the missing pieces more completely and transparently via a Solr search plugin and/or filter.  What has been lacking is time.  If you are in a position to do development in this area, we're happy to have any assistance you might provide.

Thanks,
Karl
________________________________
From: ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
Sent: Tuesday, April 20, 2010 5:06 AM
To: connectors-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Solr and LCF security at query time

Hi,

I don't see in LCF wiki how Solr and LCF works together at query time in order to remove from the result list the items the user is not allowed to access.

In http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html, I just see these sentences :

" Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results. For Lucene, this infrastructure is expected to be built on top of Lucene's generic metadata abilities, but has not been implemented at this time."

I am not sure to understand. Does this mean that for the moment, it is not possible for Solr to apply security by using an Authority Connector ?

Dominique


Re: FW: Solr and LCF security at query time

Posted by Peter Sturge <pe...@googlemail.com>.
If you want to do this completely within Solr, have a look at:
SOLR-1834 and SOLR-1872. These use a SearchComponent plugin for Solr.

Thanks,
Peter



On Tue, Apr 20, 2010 at 1:25 PM, <ka...@nokia.com> wrote:

>  FYI
>
>  ------------------------------
> *From:* Wright Karl (Nokia-S/Cambridge)
> *Sent:* Tuesday, April 20, 2010 8:16 AM
> *To:* 'dominique.bejean@eolya.fr'
> *Cc:* 'solr-dev@apache.org'; 'connectors-dev@incubator.apache.org'; '
> connectors-user@incubator.apache.org'
> *Subject:* RE: Solr and LCF security at query time
>
>  Dominique,
>
> Yes, I am aware of this ticket and contribution.  Luckily LCF establishes a
> powerful multi-repository security model, even though it doesn't yet do the
> final step of enforcing that model at the search end.  LCF allows you to
> define multiple authorities to operate against disparate repositories, and
> use the appropriate authority to secure any given document.  The solr people
> are aware of this design, which addresses the issues raised by SOLR-1834
> very nicely.  However, as I said before, time is a problem, and the work
> still needs to be done.
>
> I suggest you read up on the actual security model of LCF, and perhaps
> experiment with that and the SOLR-1834 contribution, to see if there is
> common ground.  One thing we've learned at MetaCarta is that post-filtering
> for security purposes is expensive, and it is better to modify the queries
> themselves to restrict the results, if possible.  I'm not sure which
> approach SOLR-1834 takes, although it sounds like it might be the filtering
> approach.  Still, it would be better than nothing.
>
> Please let me know what you find out.
>
> Thanks,
> Karl
>
>  ------------------------------
> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> *Sent:* Tuesday, April 20, 2010 8:03 AM
> *To:* Wright Karl (Nokia-S/Cambridge)
> *Cc:* connectors-user@incubator.apache.org;
> connectors-dev@incubator.apache.org
> *Subject:* Re: Solr and LCF security at query time
>
> Karl,
>
> Thank you for your reply.
>
> I made some research today and I found this :
> http://freesurf001.appspot.com/issues.apache.org/jira/browse/SOLR-1834
> http://demo.findwise.se:8880/SolrSecurity/
>
> Sorl security model have to be able to filter result list with items coming
> from various sources at the same time (livelink, documentum, file system,
> ...). Big subject :)
>
> Dominique
>
>
> Le 20/04/10 13:34, karl.wright@nokia.com a écrit :
>
> Hi Dominique,
>
> At the moment, in order to enforce the LCF security model within
> Lucene/Solr, you will need to build this functionality into whatever client
> you are using to display the Lucene search results.  Specifically, you would
> need to take the following steps:
>
> (1) Have your users access your search client through Apache.
> (2) Use the Apache module mod_auth_kerb, combined with LCF's
> mod_authz_annotate, to cause authorization HTTP headers to be transmitted to
> the client webapp.
> (3) Have your client webapp alter whatever queries it is doing, to add an
> appropriate query clause for each of the access tokens transmitted in the
> headers.
>
> (This is how it is done at MetaCarta.)
>
> Alternatively, you may find a way to do this completely with a web
> application under a Java app server such as Tomcat.  I have not yet done the
> research to find out whether this is a feasible alternative.  Effectively,
> what you need something like mod_auth_kerb to do is to authenticate your
> user against Active Directory, or whomever the authenticator ought to be.
> JAAS may be helpful here.
>
> There are, of course, intentions to fill out the missing pieces more
> completely and transparently via a Solr search plugin and/or filter.  What
> has been lacking is time.  If you are in a position to do development in
> this area, we're happy to have any assistance you might provide.
>
> Thanks,
> Karl
>  ------------------------------
> *From:* ext Dominique Bejean [mailto:dominique.bejean@eolya.fr<do...@eolya.fr>]
>
> *Sent:* Tuesday, April 20, 2010 5:06 AM
> *To:* connectors-user@incubator.apache.org
> *Subject:* Solr and LCF security at query time
>
> Hi,
>
> I don't see in LCF wiki how Solr and LCF works together at query time in
> order to remove from the result list the items the user is not allowed to
> access.
>
> In
> http://cwiki.apache.org/CONNECTORS/lucene-connectors-framework-concepts.html,
> I just see these sentences :
>
> " Once all these documents and their access tokens are handed to the search
> engine, it is the search engine's job to enforce security by excluding
> inappropriate documents from the search results. For *Lucene*, this
> infrastructure is expected to be built on top of Lucene's generic metadata
> abilities, but has not been implemented at this time."
>
> I am not sure to understand. Does this mean that for the moment, it is not
> possible for Solr to apply security by using an Authority Connector ?
>
> Dominique
>
>