You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Charles Givre <cg...@gmail.com> on 2022/01/13 20:29:12 UTC

[DISCUSS] Per User Access Controls

Hello all, 
One of the issues we've been dancing around is having per-user access controls in Drill.  As Drill was originally built around the Hadoop ecosystem, the Hadoop based connections make use of user-impersonation for per user access controls.  However, a rather glaring deficiency is the lack of per-user access controls for connections like JDBC, Mongo, Splunk etc.

Recently when I was working on OAuth pull request, it occurred to me that we might be able to slightly extend the credential provider interface to allow for per-user credentials.  Here's what I was thinking... 

A bit of background:  The credential provider interface is really an abstraction for a HashMap.  Here's my proposal.... The cred provider interface would store two hashmaps, one for per-user creds and one for global creds.   When a user is authenticated to Drill, when they create a storage plugin connection, the credential provider would associate the creds with their Drill username.  The storage plugins that use credential provider would thus get per-user credentials.  

If users did not want per-user credentials, they could simply use direct credentials OR use specify that in the credential provider classes.  What do you think?  

Best,
-- C

Re: [DISCUSS] Per User Access Controls

Posted by James Turton <ja...@somecomputer.xyz.INVALID>.

Footnote: once #2422 is merged there will be precedent for the mentioned, enhanced kind of storage plugin that establishes connections per Drill user. The Phoenix plugin will do this when impersonation is switched on. But note that it makes no use of creds providers, being entirely based on Kerberos. 

On 14 January 2022 11:51:09 GMT+02:00, James Turton <dz...@apache.org> wrote:
>Thanks to everyone who shared the thoughtful observations about security 
>architecture.  This email is just to add some specific feedback on the 
>credential provider proposal below.
>
>Only the in-memory PlainCredentialsProvider wraps a Map of credentials.  
>The other implementations only ever construct such Maps on the fly from 
>their backing store upon receiving a call to getCredentials().  
>Nevertheless, every provider could in principle be augmented with an 
>additional in-memory store in the form of a new member Map, this being 
>the second such Map in the case of PlainCredentialsProvider.
>
>Now the hope is that the proposed additional credentials Map will bring 
>support for user-scoped credentials to the credential providers, let's 
>work through an example to see what happens. Imagine a Drill environment 
>with two users, Alice being an admin that can create storage configs.  
>Alice logs in and creates a storage config called "postgresql".  She 
>must capture persistent credentials for "postgresql" in one of the 
>following supported places: inline in the JSON, in env vars on the 
>server, in the Hadoop conf.xml on the server, or in HashiCorp Vault.  
>Drill doesn't write to any of those places on its own so she has to 
>write to the relevant store directly herself.
>
>Crucially, only one set of credentials for "postgresql" can be captured 
>in any of the listed persistent stores.  It would not help if the creds 
>provider impl, which currently does not participate in storage config 
>creation all, could also record Alice's credentials to a new in-memory 
>Map which remembers that they belong to Alice. When the Drillbit is 
>restarted, the single set of persistent credentials for "postgresql" 
>will be read back in leaving Bob with no place to persist his own 
>"postgresql" credentials.
>
>Even if we imagine a creds provider impl that correctly persists and 
>returns credentials specific to the active Drill user, storage plugins 
>themselves would need to change.  Instead of obtaining credentials and 
>establishing connections out during their initialisation, they would 
>need to re-obtain credentials on every new query and check to see 
>whether they already have a connection out for those credentials, or 
>need to establish a new one.
>
>These things aren't impossible but the changes run deeper than adding a 
>Map and a little logic to the creds providers.
>
>
>On 2022/01/13 22:29, Charles Givre wrote:
>> Hello all,
>> One of the issues we've been dancing around is having per-user access controls in Drill.  As Drill was originally built around the Hadoop ecosystem, the Hadoop based connections make use of user-impersonation for per user access controls.  However, a rather glaring deficiency is the lack of per-user access controls for connections like JDBC, Mongo, Splunk etc.
>>
>> Recently when I was working on OAuth pull request, it occurred to me that we might be able to slightly extend the credential provider interface to allow for per-user credentials.  Here's what I was thinking...
>>
>> A bit of background:  The credential provider interface is really an abstraction for a HashMap.  Here's my proposal.... The cred provider interface would store two hashmaps, one for per-user creds and one for global creds.   When a user is authenticated to Drill, when they create a storage plugin connection, the credential provider would associate the creds with their Drill username.  The storage plugins that use credential provider would thus get per-user credentials.
>>
>> If users did not want per-user credentials, they could simply use direct credentials OR use specify that in the credential provider classes.  What do you think?
>>
>> Best,
>> -- C
>>
>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [DISCUSS] Per User Access Controls

Posted by James Turton <dz...@apache.org>.

Thanks to everyone who shared the thoughtful observations about security 
architecture.  This email is just to add some specific feedback on the 
credential provider proposal below.

Only the in-memory PlainCredentialsProvider wraps a Map of credentials.  
The other implementations only ever construct such Maps on the fly from 
their backing store upon receiving a call to getCredentials().  
Nevertheless, every provider could in principle be augmented with an 
additional in-memory store in the form of a new member Map, this being 
the second such Map in the case of PlainCredentialsProvider.

Now the hope is that the proposed additional credentials Map will bring 
support for user-scoped credentials to the credential providers, let's 
work through an example to see what happens. Imagine a Drill environment 
with two users, Alice being an admin that can create storage configs.  
Alice logs in and creates a storage config called "postgresql".  She 
must capture persistent credentials for "postgresql" in one of the 
following supported places: inline in the JSON, in env vars on the 
server, in the Hadoop conf.xml on the server, or in HashiCorp Vault.  
Drill doesn't write to any of those places on its own so she has to 
write to the relevant store directly herself.

Crucially, only one set of credentials for "postgresql" can be captured 
in any of the listed persistent stores.  It would not help if the creds 
provider impl, which currently does not participate in storage config 
creation all, could also record Alice's credentials to a new in-memory 
Map which remembers that they belong to Alice. When the Drillbit is 
restarted, the single set of persistent credentials for "postgresql" 
will be read back in leaving Bob with no place to persist his own 
"postgresql" credentials.

Even if we imagine a creds provider impl that correctly persists and 
returns credentials specific to the active Drill user, storage plugins 
themselves would need to change.  Instead of obtaining credentials and 
establishing connections out during their initialisation, they would 
need to re-obtain credentials on every new query and check to see 
whether they already have a connection out for those credentials, or 
need to establish a new one.

These things aren't impossible but the changes run deeper than adding a 
Map and a little logic to the creds providers.

On 2022/01/13 22:29, Charles Givre wrote:
> Hello all,
> One of the issues we've been dancing around is having per-user access controls in Drill.  As Drill was originally built around the Hadoop ecosystem, the Hadoop based connections make use of user-impersonation for per user access controls.  However, a rather glaring deficiency is the lack of per-user access controls for connections like JDBC, Mongo, Splunk etc.
>
> Recently when I was working on OAuth pull request, it occurred to me that we might be able to slightly extend the credential provider interface to allow for per-user credentials.  Here's what I was thinking...
>
> A bit of background:  The credential provider interface is really an abstraction for a HashMap.  Here's my proposal.... The cred provider interface would store two hashmaps, one for per-user creds and one for global creds.   When a user is authenticated to Drill, when they create a storage plugin connection, the credential provider would associate the creds with their Drill username.  The storage plugins that use credential provider would thus get per-user credentials.
>
> If users did not want per-user credentials, they could simply use direct credentials OR use specify that in the credential provider classes.  What do you think?
>
> Best,
> -- C
>

Re: Re: [DISCUSS] Per User Access Controls

Posted by Ted Dunning <te...@gmail.com>.

GRANT and REVOKE implicitly assumes that the database is king of access
control. That works when the database owns the data.

In the modern world where data storage is separated from query, it is truly
painful to have to manage permissions for each analysis and each query tool
and nearly impossible to keep them synchronized. Likewise, it is impossible
to get plugins for systems like Ranger for all possible tools and
impossible for Ranger to even understand all tools.

For instance, suppose you have S3 data, files and database. Each has
permissions already defined. Now you have users who want to use Drill (for
SQL processing), Jupyter notebooks with Python for data engineering, Julia
with Pluto notebooks for numerical work and batchwise Spark jobs all for
building data pipelines across all the kinds of data. Neither Python, Julia
nor Spark can really be protected by Ranger. All assume file permissions or
S3 IAMs do that job.



On Thu, Jan 13, 2022 at 10:49 PM Z0ltrix <z0...@pm.me.invalid> wrote:

> Hi @All,
>
> for me, that uses Drill with a kerberized hadoop cluster and Ranger as
> central Access-Control-System i would love to have an Ranger-Plugin for
> Drill, but i would assume a lot Drill users just spins up a cluster in
> front of S3 or azure.
>
> So why not using a generic approach with GRANT and REVOKE for users and
> groups on specific workspaces, or at least storage plugins?
>
> With that an admin can control which users and groups can access all
> storage plugins we have, no matter if the underneath plugin has such a
> system.
>
>
> Maybe we could use the Metastore to store such information?
>
> Regards,
> Christian
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>
> Paul Rogers <pa...@gmail.com> schrieb am Donnerstag, 13. Januar 2022 um
> 23:40:
>
> > Hey All,
> >
>
> > Other members of the Hadoop Ecosystem rely on external systems to handle
> >
>
> > permissions: Ranger or Sentry. There is probably something different in
> the
> >
>
> > AWS world.
> >
>
> > As you look into security, you'll see that you need to maintain
> permissions
> >
>
> > on many entities: files, connections, etc. You need different
> permissions:
> >
>
> > read, write, create, etc. In larger groups of people, you need roles:
> admin
> >
>
> > role, sales analyst role, production engineer role. Users map to roles,
> and
> >
>
> > roles take permissions.
> >
>
> > Creating this just for Drill is not effective: no one wants to learn a
> >
>
> > Drill "Security Store" any more than folks want to learn the "Drill
> >
>
> > metastore". Drill is seldom the only tool in a shop: people want to set
> >
>
> > permissions in one place, not in each tool. So, we should integrate with
> >
>
> > existing tools.
> >
>
> > Drill should provide an API, and be prepared to enforce rules. Drill
> >
>
> > defines the entities that can be secured, and the available permissions.
> >
>
> > Then, it is up to an external system to provide user identity, take
> tuples
> >
>
> > of (user, resource, permission) and return a boolean of whether that user
> >
>
> > is authorized or not. MapR, Pam, Hadoop and other systems would be
> >
>
> > implemented on top of the Drill permissions API, as would whatever need
> you
> >
>
> > happen to have.
> >
>
> > Thanks,
> >
>
> > -   Paul
> >
>
> >     On Thu, Jan 13, 2022 at 12:32 PM Curtis Lambert
> curtis@datadistillr.com
> >
>
> > wrote:
> >
>
> > > This is what we are handling with Vault outside of Drill, combined with
> > >
>
> > > aliasing. James is tracking some of what you've been finding with the
> > >
>
> > > credential store but even then we want the single source of auth. We
> can
> > >
>
> > > chat with James on the next Drill stand up (and anyone else who wants
> to
> > >
>
> > > feel the pain).
> > >
>
> > > [image: avatar]
> > >
>
> > > Curtis Lambert
> > >
>
> > > CTO
> > >
>
> > > Email:
> > >
>
> > > curtis@datadistillr.com curtis@datdistillr.com
> > >
>
> > > Phone:
> > >
>
> > > -   706-402-0249
> > >
>
> > >     [image: LinkedIn]LinkedIn
> > >
>
> > >     https://www.linkedin.com/in/curtis-lambert-2009b2141/ [image:
> Calendly]
> > >
>
> > >     Calendly https://calendly.com/curtis283/generic-zoom
> > >
>
> > >     [image: Data Distillr logo] https://www.datadistillr.com/
> > >
>
> > > On Thu, Jan 13, 2022 at 3:29 PM Charles Givre cgivre@gmail.com wrote:
> > >
>
> > > > Hello all,
> > > >
>
> > > > One of the issues we've been dancing around is having per-user access
> > > >
>
> > > > controls in Drill. As Drill was originally built around the Hadoop
> > > >
>
> > > > ecosystem, the Hadoop based connections make use of
> user-impersonation
> > > >
>
> > > > for
> > > >
>
> > > > per user access controls. However, a rather glaring deficiency is the
> > > >
>
> > > > lack
> > > >
>
> > > > of per-user access controls for connections like JDBC, Mongo, Splunk
> etc.
> > > >
>
> > > > Recently when I was working on OAuth pull request, it occurred to me
> that
> > > >
>
> > > > we might be able to slightly extend the credential provider
> interface to
> > > >
>
> > > > allow for per-user credentials. Here's what I was thinking...
> > > >
>
> > > > A bit of background: The credential provider interface is really an
> > > >
>
> > > > abstraction for a HashMap. Here's my proposal.... The cred provider
> > > >
>
> > > > interface would store two hashmaps, one for per-user creds and one
> for
> > > >
>
> > > > global creds. When a user is authenticated to Drill, when they
> create a
> > > >
>
> > > > storage plugin connection, the credential provider would associate
> the
> > > >
>
> > > > creds with their Drill username. The storage plugins that use
> credential
> > > >
>
> > > > provider would thus get per-user credentials.
> > > >
>
> > > > If users did not want per-user credentials, they could simply use
> direct
> > > >
>
> > > > credentials OR use specify that in the credential provider classes.
> What
> > > >
>
> > > > do you think?
> > > >
>
> > > > Best,
> > > >
>
> > > > -- C

AW: Re: [DISCUSS] Per User Access Controls

Posted by Z0ltrix <z0...@pm.me.INVALID>.

Hi @All,

for me, that uses Drill with a kerberized hadoop cluster and Ranger as central Access-Control-System i would love to have an Ranger-Plugin for Drill, but i would assume a lot Drill users just spins up a cluster in front of S3 or azure.

So why not using a generic approach with GRANT and REVOKE for users and groups on specific workspaces, or at least storage plugins?

With that an admin can control which users and groups can access all storage plugins we have, no matter if the underneath plugin has such a system. 


Maybe we could use the Metastore to store such information?

Regards,
Christian

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Paul Rogers <pa...@gmail.com> schrieb am Donnerstag, 13. Januar 2022 um 23:40:

> Hey All,
> 

> Other members of the Hadoop Ecosystem rely on external systems to handle
> 

> permissions: Ranger or Sentry. There is probably something different in the
> 

> AWS world.
> 

> As you look into security, you'll see that you need to maintain permissions
> 

> on many entities: files, connections, etc. You need different permissions:
> 

> read, write, create, etc. In larger groups of people, you need roles: admin
> 

> role, sales analyst role, production engineer role. Users map to roles, and
> 

> roles take permissions.
> 

> Creating this just for Drill is not effective: no one wants to learn a
> 

> Drill "Security Store" any more than folks want to learn the "Drill
> 

> metastore". Drill is seldom the only tool in a shop: people want to set
> 

> permissions in one place, not in each tool. So, we should integrate with
> 

> existing tools.
> 

> Drill should provide an API, and be prepared to enforce rules. Drill
> 

> defines the entities that can be secured, and the available permissions.
> 

> Then, it is up to an external system to provide user identity, take tuples
> 

> of (user, resource, permission) and return a boolean of whether that user
> 

> is authorized or not. MapR, Pam, Hadoop and other systems would be
> 

> implemented on top of the Drill permissions API, as would whatever need you
> 

> happen to have.
> 

> Thanks,
> 

> -   Paul
>     

>     On Thu, Jan 13, 2022 at 12:32 PM Curtis Lambert curtis@datadistillr.com
> 

> wrote:
> 

> > This is what we are handling with Vault outside of Drill, combined with
> > 

> > aliasing. James is tracking some of what you've been finding with the
> > 

> > credential store but even then we want the single source of auth. We can
> > 

> > chat with James on the next Drill stand up (and anyone else who wants to
> > 

> > feel the pain).
> > 

> > [image: avatar]
> > 

> > Curtis Lambert
> > 

> > CTO
> > 

> > Email:
> > 

> > curtis@datadistillr.com curtis@datdistillr.com
> > 

> > Phone:
> > 

> > -   706-402-0249
> >     

> >     [image: LinkedIn]LinkedIn
> >     

> >     https://www.linkedin.com/in/curtis-lambert-2009b2141/ [image: Calendly]
> >     

> >     Calendly https://calendly.com/curtis283/generic-zoom
> >     

> >     [image: Data Distillr logo] https://www.datadistillr.com/
> > 

> > On Thu, Jan 13, 2022 at 3:29 PM Charles Givre cgivre@gmail.com wrote:
> > 

> > > Hello all,
> > > 

> > > One of the issues we've been dancing around is having per-user access
> > > 

> > > controls in Drill. As Drill was originally built around the Hadoop
> > > 

> > > ecosystem, the Hadoop based connections make use of user-impersonation
> > > 

> > > for
> > > 

> > > per user access controls. However, a rather glaring deficiency is the
> > > 

> > > lack
> > > 

> > > of per-user access controls for connections like JDBC, Mongo, Splunk etc.
> > > 

> > > Recently when I was working on OAuth pull request, it occurred to me that
> > > 

> > > we might be able to slightly extend the credential provider interface to
> > > 

> > > allow for per-user credentials. Here's what I was thinking...
> > > 

> > > A bit of background: The credential provider interface is really an
> > > 

> > > abstraction for a HashMap. Here's my proposal.... The cred provider
> > > 

> > > interface would store two hashmaps, one for per-user creds and one for
> > > 

> > > global creds. When a user is authenticated to Drill, when they create a
> > > 

> > > storage plugin connection, the credential provider would associate the
> > > 

> > > creds with their Drill username. The storage plugins that use credential
> > > 

> > > provider would thus get per-user credentials.
> > > 

> > > If users did not want per-user credentials, they could simply use direct
> > > 

> > > credentials OR use specify that in the credential provider classes. What
> > > 

> > > do you think?
> > > 

> > > Best,
> > > 

> > > -- C

Re: [DISCUSS] Per User Access Controls

Posted by Paul Rogers <pa...@gmail.com>.

Hey All,

Other members of the Hadoop Ecosystem rely on external systems to handle
permissions: Ranger or Sentry. There is probably something different in the
AWS world.

As you look into security, you'll see that you need to maintain permissions
on many entities: files, connections, etc. You need different permissions:
read, write, create, etc. In larger groups of people, you need roles: admin
role, sales analyst role, production engineer role. Users map to roles, and
roles take permissions.

Creating this just for Drill is not effective: no one wants to learn a
Drill "Security Store" any more than folks want to learn the "Drill
metastore". Drill is seldom the only tool in a shop: people want to set
permissions in one place, not in each tool. So, we should integrate with
existing tools.

Drill should provide an API, and be prepared to enforce rules. Drill
defines the entities that can be secured, and the available permissions.
Then, it is up to an external system to provide user identity, take tuples
of (user, resource, permission) and return a boolean of whether that user
is authorized or not. MapR, Pam, Hadoop and other systems would be
implemented on top of the Drill permissions API, as would whatever need you
happen to have.

Thanks,

- Paul

On Thu, Jan 13, 2022 at 12:32 PM Curtis Lambert <cu...@datadistillr.com>
wrote:

> This is what we are handling with Vault outside of Drill, combined with
> aliasing. James is tracking some of what you've been finding with the
> credential store but even then we want the single source of auth. We can
> chat with James on the next Drill stand up (and anyone else who wants to
> feel the pain).
>
>
>
> [image: avatar]
> Curtis Lambert
> CTO
> Email:
>
> curtis@datadistillr.com <cu...@datdistillr.com>
> Phone:
>
> + 706-402-0249
> [image: LinkedIn]LinkedIn
> <https://www.linkedin.com/in/curtis-lambert-2009b2141/> [image: Calendly]
> Calendly <https://calendly.com/curtis283/generic-zoom>
> [image: Data Distillr logo] <https://www.datadistillr.com/>
>
>
> On Thu, Jan 13, 2022 at 3:29 PM Charles Givre <cg...@gmail.com> wrote:
>
> > Hello all,
> > One of the issues we've been dancing around is having per-user access
> > controls in Drill.  As Drill was originally built around the Hadoop
> > ecosystem, the Hadoop based connections make use of user-impersonation
> for
> > per user access controls.  However, a rather glaring deficiency is the
> lack
> > of per-user access controls for connections like JDBC, Mongo, Splunk etc.
> >
> > Recently when I was working on OAuth pull request, it occurred to me that
> > we might be able to slightly extend the credential provider interface to
> > allow for per-user credentials.  Here's what I was thinking...
> >
> > A bit of background:  The credential provider interface is really an
> > abstraction for a HashMap.  Here's my proposal.... The cred provider
> > interface would store two hashmaps, one for per-user creds and one for
> > global creds.   When a user is authenticated to Drill, when they create a
> > storage plugin connection, the credential provider would associate the
> > creds with their Drill username.  The storage plugins that use credential
> > provider would thus get per-user credentials.
> >
> > If users did not want per-user credentials, they could simply use direct
> > credentials OR use specify that in the credential provider classes.  What
> > do you think?
> >
> > Best,
> > -- C
> >
> >
>

Re: [DISCUSS] Per User Access Controls

Posted by Curtis Lambert <cu...@datadistillr.com>.

This is what we are handling with Vault outside of Drill, combined with
aliasing. James is tracking some of what you've been finding with the
credential store but even then we want the single source of auth. We can
chat with James on the next Drill stand up (and anyone else who wants to
feel the pain).



[image: avatar]
Curtis Lambert
CTO
Email:

curtis@datadistillr.com <cu...@datdistillr.com>
Phone:

+ 706-402-0249
[image: LinkedIn]LinkedIn
<https://www.linkedin.com/in/curtis-lambert-2009b2141/> [image: Calendly]
Calendly <https://calendly.com/curtis283/generic-zoom>
[image: Data Distillr logo] <https://www.datadistillr.com/>


On Thu, Jan 13, 2022 at 3:29 PM Charles Givre <cg...@gmail.com> wrote:

> Hello all,
> One of the issues we've been dancing around is having per-user access
> controls in Drill.  As Drill was originally built around the Hadoop
> ecosystem, the Hadoop based connections make use of user-impersonation for
> per user access controls.  However, a rather glaring deficiency is the lack
> of per-user access controls for connections like JDBC, Mongo, Splunk etc.
>
> Recently when I was working on OAuth pull request, it occurred to me that
> we might be able to slightly extend the credential provider interface to
> allow for per-user credentials.  Here's what I was thinking...
>
> A bit of background:  The credential provider interface is really an
> abstraction for a HashMap.  Here's my proposal.... The cred provider
> interface would store two hashmaps, one for per-user creds and one for
> global creds.   When a user is authenticated to Drill, when they create a
> storage plugin connection, the credential provider would associate the
> creds with their Drill username.  The storage plugins that use credential
> provider would thus get per-user credentials.
>
> If users did not want per-user credentials, they could simply use direct
> credentials OR use specify that in the credential provider classes.  What
> do you think?
>
> Best,
> -- C
>
>