You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "paul-rogers (via GitHub)" <gi...@apache.org> on 2023/02/23 00:16:39 UTC

[GitHub] [druid] paul-rogers opened a new issue, #13837: Input source security model for MSQ table functions and more

paul-rogers opened a new issue, #13837:
URL: https://github.com/apache/druid/issues/13837

   MSQ provides a powerful way to ingest data into Druid using SQL, placing ingestion in the hands of more users. Prior to MSQ, only those users with sufficient knowledge to use a batch ingestion spec could ingest data. With the wider audience comes the need to add protections to prevent users from accessing data that doesn't make sense in a particular Druid deployment. This proposal outlines a proposed extension to the Druid security model to address this issue. As it turns out the Druid catalog feature has similar needs, which we also address.
   
   ## Input Source Security in Druid Today
   
   Let us start by reviewing the current security model. Druid's model has two parts:
   
   * A "resource action" (class `ResourceAction`) which is essentially a triple of (category, name, read/write).
   * An "authenticator mapper (class `AuthenticatorMapper`) which provides a yes/no answer to the question, "does the user have access to this set of resource actions"?
   
   In MSQ today, there is a single resource action: `(EXTERNAL, EXTERNAL, READ)` which was introduced along with the `extern` table function. That is, that single permission allows the user to read data from S3, from the local disk, from HTTP, etc.
   
   The key gap in this model is the lack of fine-grain control over which input sources to allow. (A site may want to allow access to S3, but not the local file system on each ingest node.)
   
   ## Proposed Input Source Security Model
   
   The proposed model extends the current system to add additional items to the second (name) element of the resource action. Specifically, to use the JSON type name for each input source, so that access to the HTTP input source would be checked as `(EXTERNAL, http, READ)`. Similarly for all other Druid input sources.
   
   MSQ currently provides multiple ways to access a given input source: via the `extern` function, or via the newer input-source-specific functions such as `http` or `localfiles`. By applying security at the level of the input source, the same security rules apply regardless of the function used to access the input source.
   
   ## Security Model for External Tables Defined in the Catalog
   
   The Druid catalog provides the ability to define an external table: a metadata entry that can be used in MSQ queries in lieu of spelling out the input source details in each query. The catalog already uses security rules that mimic Druid datasources. To access an external table the user needs permission on `(ext, <table name>, READ)`, where `ext` is a new Druid schema introduced to hold external table definitions.
   
   We propose to extend the security model to _also_ need permission on the underlying input source type. Thus, if `myS3` is an external table that accesses S3, then the user needs both `(ext, myS3, READ)` and `(EXTERN, s3, READ)` permissions. 
   
   ## Security Model for Custom MSQ Table Functions
   
   A great strength of Druid is the ability to add new input sources and "user defined" SQL functions. These can be combined to provide new MSQ table functions for new input sources. The simplest such function is a wrapper. Suppose I want to ingest data from [IPFS](https://en.wikipedia.org/wiki/InterPlanetary_File_System). I would first write the input source, then provide an `ipfs` table function which would check the required `(EXTERNAL, ipfs, READ)` permissions. (Note that the "ipfs" in the resource action is the name of the input source, not the table function.)
   
   In another scenario, I might write a specialized table function to access an application's staging area. That staging area is on S3, but the user need not know that. (We might later change it to GCP.) In this case, the function would create a "virtual" input source, say, "abc-staging" and permissions would be granted on that name: `(EXTERNAL abc-staging, READ)`. The extension could use S3 internally, but that's an implementation detail.
   
   ## Code Revisions
   
   Because we are adding a security change, we must ensure _all_ of Druid honors the new rules, not just MSQ. An outline of the changes:
   
   * Modify each existing SQL table function to use `(EXTERNAL, <input source>, READ)` instead of `(EXTERNAL, EXTERNAL, READ)`. Note: this could be a breaking change for existing users.
   * Extend the catalog code to add the input source security check in addition to the external table check, as noted above.
   * Extend native batch ingest specs that use input sources to apply the new security checks
   * Determine how to apply the rules to native batch ingest that use firehose factories
   * Determine how to apply the rules to Hadoop ingest specs that use Hadoop FS paths
   * Determine how to apply the rules to various flavors of realtime ingest
   
   ## Documentation and Release Notes
   
   Documentation must explain the new permissions and security models. (Start with the text above.)
   
   Release notes should announce when this model is available. The note must clearly state that if customers are granting permissions on the existing MSQ `(EXTERNAL, EXTERNAL, READ)` resource action, the rules should instead grant access on `(EXTERNAL, * READ)` to avoid ingestion failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on issue #13837: Input source security model for MSQ table functions and more

Posted by "zachjsh (via GitHub)" <gi...@apache.org>.
zachjsh commented on issue #13837:
URL: https://github.com/apache/druid/issues/13837#issuecomment-1442555517

   thanks for the writeup @paul-rogers . This seems like a great idea! 
   
   One question / concern I had is how this would handle the case where users were previously relying on reasource action `(EXTERNAL, EXTERNAL, READ)` to read from any inputSource type? During upgrade, we could maybe migrate such a permission for users / roles to give permissions to all configured input source types configured at that time, just as an example. Wondering if you considered this and what idea you may have around this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #13837: Input source security model for MSQ table functions and more

Posted by "gianm (via GitHub)" <gi...@apache.org>.
gianm commented on issue #13837:
URL: https://github.com/apache/druid/issues/13837#issuecomment-1447182330

   With regard to backwards-compatibility with `(EXTERNAL, EXTERNAL, READ)` stuff, the feature flag approach sounds good to me. The documentation on this page would need to be updated as well: https://druid.apache.org/docs/latest/multi-stage-query/security.html.
   
   Some notes on the other stuff we'll need to sort out as part of this:
   
   ### non-http protocols via `http` input source
   
   The `http` input source is implementing using [java.net.URLConnection](https://docs.oracle.com/javase/8/docs/api/java/net/URLConnection.html), which can handle various protocols other than http (including local `file://`). Currently the config `druid.ingestion.http.allowedProtocols` (default: `http, https`) is used to control which protocols are permitted via this input source.
   
   We should consider how this all fits together. Perhaps something like this:
   
   - `(EXTERNAL, http, READ)` refers to the `http` input source.
   - The `http` input source may, if `druid.ingestion.http.allowedProtocols` is set, handle non-http protocols. This isn't the concern of the authorization layer.
   - To ensure that people who use either of the above features (`EXTERNAL` authorization, or `druid.ingestion.http.allowedProtocols`) understand their interaction, we should include notesĀ about this in the docs for both features (with examples).
   
   ### non-hdfs protocols via `hdfs` input source
   
   The `hdfs` input source has a similar behavior to the `http` input source. Like `http`, it supports various non-hdfs protocols. Like `http`, there is a `druid.ingestion.hdfs.allowedProtocols` that controls which protocols are allowed. Like `http`, the default set is limited to only the obvious one (`hdfs`).
   
   So, we should be able to take the same approach here that we take with `http`.
   
   ### firehose factories
   
   [Firehoses](https://druid.apache.org/docs/latest/ingestion/native-batch-firehose.html) are a deprecated predecessor to the current "input source" concept. They have been deprecated since 0.17 (late 2019). If we're going with a feature flag for the overall input-source-security feature, IMO it makes sense for that feature flag to also disable firehose factories completely. This absolves us of the responsibility to figure out how to fit them into the new security framework.
   
   ### Hadoop ingest
   
   Hadoop ingest doesn't use our input source concept: instead, it uses Hadoop filesystems and path globs. One approach that comes to mind here is to special-case it to piggyback on the native `hdfs` input source. The idea being:
   
   - If a user has `(EXTERNAL, hdfs, READ)` permissions then they can submit Hadoop ingest jobs.
   - If a user does _not_ have those permission, then they _cannot_ submit Hadoop ingest jobs.
   
   It would be excellent to, in addition, introduce a permission (or cluster-wide setting) specifically for whether it is possible to submit Hadoop jobs. People that do not use Hadoop integration would appreciate the opportunity to switch it off completely, thereby minimizing their potential attack surface.
   
   ### Realtime ingest
   
   Realtime ingest doesn't use our input source concept: instead, it uses Kafka and Kinesis supervisors with system-specific `ioConfig` APIs.
   
   One approach that comes to mind is something similar to the proposal for Hadoop above: special-case these to use `(EXTERNAL, kafka, READ)` and `(EXTERNAL, kinesis, READ)` respectively. This doesn't make quite as much sense as the Hadoop case, since while there _is_ an `hdfs` input source, there are no `kafka` and `kinesis` input sources.
   
   I'm open to other ideas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh closed issue #13837: Input source security model for MSQ table functions and more

Posted by "zachjsh (via GitHub)" <gi...@apache.org>.
zachjsh closed issue #13837: Input source security model for MSQ table functions and more
URL: https://github.com/apache/druid/issues/13837


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] paul-rogers commented on issue #13837: Input source security model for MSQ table functions and more

Posted by "paul-rogers (via GitHub)" <gi...@apache.org>.
paul-rogers commented on issue #13837:
URL: https://github.com/apache/druid/issues/13837#issuecomment-1444760032

   @zachjsh, thanks for the comment. Yes, that is a whole that's been worrying me. Security is handled via extensions. If those extensions are set up to handle all `EXTERNAL` resources the same, then this change is backward-compatible. But, if any one system has explicitly handles `(EXTERNAL, EXTERNAL, READ)`, then we'll break things, which is not ideal.
   
   One possible solution is to add a feature flag to enable "enhanced" input source security. A trick will be to wire that up to the right spot in Calcite since properties are given via Guice, and Calcite doesn't play the Guice game. I'll work this out when I tinker with the code. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org