You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/02/03 19:32:40 UTC

[GitHub] [druid] maytasm3 opened a new issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

maytasm3 opened a new issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305
 
 
   ### Description
   Cloud InputSource (such as S3, etc.) should be able to support taking in credentials optionally in order to be able to read from multiple buckets/location. This is in order to use different credentials than the default (i.e. the one provided in the common runtime property) and also to be able to use different credentials for each ingestion.
   
   ### Motivation
   The source of raw data (input data) can be stored in a different bucket/location which have different credentials than the deep storage location. For example, we might want to consume from s3 and also use s3 as a deep storage, however, the raw data s3 and deep storage s3 are different buckets with different credentials. 
   We may also want to ingest from multiple bucket/location. For example, we might want to consume from multiple s3 buckets each with its own credentials. We should not have to go and change the `druid.s3.accesskey` to read raw data from different bucket for every ingestion. 
   
   ### Proposed changes
   Adding two new fields to the inputSource spec. These fields are overrideAccessKeyId and overrideSecretAccessKey 
   
   These fields are optional. 
   
   ...
       "ioConfig": {
         "type": "index_parallel",
         "inputSource": {
           "type": "s3",
           "prefixes": ["s3://foo/bar", "s3://bar/foo"]
           "overrideAccessKeyId":"123455"
           "overrideSecretAccessKey":"masdjlaksjdlakjsd"
         },
         "inputFormat": {
           "type": "json"
         },
         ...
       },
   ...
   If both these fields are not present then the normal cloud client/configuration will be used (current behavior). If one field present but not both then the task will failed. If both fields are given, a new client will be constructed for this ingestion task. The new client will use the access key and secret key given with the rest of the other configurations (i.e. region, etc) from the one currently used. 
   
   For example, we will create a new s3 client for inputSource type s3 with overrideAccessKeyId and overrideSecretAccessKey given with something similar to below:
   
   BasicAWSCredentials awsCreds = new BasicAWSCredentials("access_key_id", "secret_key_id");
   AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
                           .withCredentials(new AWSStaticCredentialsProvider(awsCreds))
                           .build();
   This new client will only be use for reading the input data. Deep storage will still be handle with the default client/configurations. 
   
   The fields overrideAccessKeyId and overrideSecretAccessKey can also be encrypted. Encrypt/decrypt can be done using a pre-set ENV configuration. User will have to encrypt it prior to putting it in the ingestion spec. Druid will use the pre-set Env configuration to decrypt on ingestion. Hence, the fields in the ingestionSpec will contains the encrypted keys. (We can also use the PasswordProvider for this)
   
   ### Rationale
   Making ingestion from cloud inputSource more flexible to use.
   
   ### Operational impact
   - Creating new cloud client for every ingestion that uses this new fields
   - No concern with backward compatibility since these new fields are optional. 
   
   ### Test plan
   - New integration tests will be added.
   - Will test in a real Druid cluster.
   
   ### Future work
   - The "Multi" input source. This is to support ingestion from multiple different cloud credentials (such as multiple s3 buckets or mixture of s3 buckets and gcs buckets where all of them have different credentials) in the same ingestionSpec. 
   - Extra optional field in ingestionSpec to takes in a path to file that contains the credentials instread of using the fields overrideAccessKeyId and overrideSecretAccessKey. 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson edited a comment on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

Posted by GitBox <gi...@apache.org>.
jihoonson edited a comment on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305#issuecomment-583880150
 
 
   Hi @maytasm3, thank you for the nice proposal! This feature will be super useful. I have some comments on the new fields.
   
   - I think we should use the `passwordProvider` to give consistent experience to Druid users.
   - I think the new fields are pretty straightforward and the prefix "override" is not necessary. It's also confusing when you read from s3 but use some other storage type as deep storage than s3.
   - I think we may need more input-source-specific configurations such as `endpoint.url`, `endpoint.signingRegion`, and so on in the future. What do you think about adding a new field containing "accessKeyId" and "secretAccessKey" fields as below?
   
   ```json
         "inputSource": {
           "type": "s3",
           "prefixes": ["s3://foo/bar", "s3://bar/foo"],
           "properties": {
             "accessKeyId": "123455",
             "secretAccessKey": "masdjlaksjdlakjsd"
           }
         }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305#issuecomment-584782984
 
 
   @maytasm3 thanks, sounds good 👍 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm3 opened a new issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

Posted by GitBox <gi...@apache.org>.
maytasm3 opened a new issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305
 
 
   ### Description
   Cloud InputSource (such as S3, etc.) should be able to support taking in credentials optionally in order to be able to read from multiple buckets/location. This is in order to use different credentials than the default (i.e. the one provided in the common runtime property) and also to be able to use different credentials for each ingestion.
   
   ### Motivation
   The source of raw data (input data) can be stored in a different bucket/location which have different credentials than the deep storage location. For example, we might want to consume from s3 and also use s3 as a deep storage, however, the raw data s3 and deep storage s3 are different buckets with different credentials. 
   We may also want to ingest from multiple bucket/location. For example, we might want to consume from multiple s3 buckets each with its own credentials. We should not have to go and change the `druid.s3.accesskey` to read raw data from different bucket for every ingestion. 
   
   ### Proposed changes
   Adding two new fields to the inputSource spec. These fields are overrideAccessKeyId and overrideSecretAccessKey 
   
   These fields are optional. 
   ```
   ...
       "ioConfig": {
         "type": "index_parallel",
         "inputSource": {
           "type": "s3",
           "prefixes": ["s3://foo/bar", "s3://bar/foo"]
           "overrideAccessKeyId":"123455"
           "overrideSecretAccessKey":"masdjlaksjdlakjsd"
         },
         "inputFormat": {
           "type": "json"
         },
         ...
       },
   ...
   ```
   If both these fields are not present then the normal cloud client/configuration will be used (current behavior). If one field present but not both then the task will failed. If both fields are given, a new client will be constructed for this ingestion task. The new client will use the access key and secret key given with the rest of the other configurations (i.e. region, etc) from the one currently used. 
   
   For example, we will create a new s3 client for inputSource type s3 with overrideAccessKeyId and overrideSecretAccessKey given with something similar to below:
   ```
   BasicAWSCredentials awsCreds = new BasicAWSCredentials("access_key_id", "secret_key_id");
   AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
                           .withCredentials(new AWSStaticCredentialsProvider(awsCreds))
                           .build();
   ```
   This new client will only be use for reading the input data. Deep storage will still be handle with the default client/configurations. 
   
   The fields overrideAccessKeyId and overrideSecretAccessKey can also be encrypted. Encrypt/decrypt can be done using a pre-set ENV configuration. User will have to encrypt it prior to putting it in the ingestion spec. Druid will use the pre-set Env configuration to decrypt on ingestion. Hence, the fields in the ingestionSpec will contains the encrypted keys. (We can also use the PasswordProvider for this)
   
   ### Rationale
   Making ingestion from cloud inputSource more flexible to use.
   
   ### Operational impact
   - Creating new cloud client for every ingestion that uses this new fields
   - No concern with backward compatibility since these new fields are optional. 
   
   ### Test plan
   - New integration tests will be added.
   - Will test in a real Druid cluster.
   
   ### Future work
   - The "Multi" input source. This is to support ingestion from multiple different cloud credentials (such as multiple s3 buckets or mixture of s3 buckets and gcs buckets where all of them have different credentials) in the same ingestionSpec. 
   - Extra optional field in ingestionSpec to takes in a path to file that contains the credentials instread of using the fields overrideAccessKeyId and overrideSecretAccessKey. 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

Posted by GitBox <gi...@apache.org>.
zachjsh commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305#issuecomment-581638464
 
 
   Thanks for proposing this @maytasm3, I think this would be a great addition. Just to confirm, if an ingestion spec was loaded with the same accessKey and secretAccessKey as another previous ingestions spec we would keep the old client around in this case and reuse it, or would we create a new client on the fly? It sounds like we already have a mechanism in place for encrypting fields in the ingestion spec, is that right? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm3 commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

Posted by GitBox <gi...@apache.org>.
maytasm3 commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305#issuecomment-584492229
 
 
   @jihoonson 
   Sorry for the confusion around passwordProvider. I think we can reuse the existing Password Providers we have for now (no need to implement new ones). We will treat the accessKeyId as one Password Provider and the secretAccessKey as another Password Provider. Similar to how AWSCredentialsConfig (where you define a separate Password Providers for the .accessKey and the .secretKey). 
   For example if user decide to use plain text:
   ```
         "inputSource": {
           "type": "s3",
           "prefixes": ["s3://foo/bar", "s3://bar/foo"],
           "properties": {
             "accessKeyId": {
                 "type": "default", 
                 "password": "abcdef"
              },
             "secretAccessKey": {
                 "type": "default", 
                 "password": "asdasdadsasd"
              },
           }
         }
   ```
   Or they can use Environment variable password provider:
   ```
         "inputSource": {
           "type": "s3",
           "prefixes": ["s3://foo/bar", "s3://bar/foo"],
           "properties": {
             "accessKeyId": {
                 "type": "environment", 
                 "variable": "<ENV.KEY.NAME.CONTAINING.ACCESS.KEY>"
              },
             "secretAccessKey": {
                 "type": "environment", 
                 "variable": "<ENV.KEY.NAME.CONTAINING.SECRET.ACCESS.KEY>"
              },
           }
         }
   ```
   Or they can have a mix:
   ```
         "inputSource": {
           "type": "s3",
           "prefixes": ["s3://foo/bar", "s3://bar/foo"],
           "properties": {
             "accessKeyId": {
                 "type": "default", 
                 "variable": "asdasdasdasd"
              },
             "secretAccessKey": {
                 "type": "environment", 
                 "variable": "<ENV.KEY.NAME.CONTAINING.SECRET.ACCESS.KEY>"
              },
           }
         }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm3 edited a comment on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

Posted by GitBox <gi...@apache.org>.
maytasm3 edited a comment on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305#issuecomment-584492229
 
 
   @jihoonson 
   Sorry for the confusion around passwordProvider. I think we can reuse the existing Password Providers we have for now (no need to implement new ones). We will treat the accessKeyId as one Password Provider and the secretAccessKey as another Password Provider. Similar to how AWSCredentialsConfig (where you define a separate Password Providers for the .accessKey and the .secretKey). 
   For example if user decide to use plain text:
   ```
         "inputSource": {
           "type": "s3",
           "prefixes": ["s3://foo/bar", "s3://bar/foo"],
           "properties": {
             "accessKeyId": {
                 "type": "default", 
                 "password": "abcdef"
              },
             "secretAccessKey": {
                 "type": "default", 
                 "password": "asdasdadsasd"
              },
           }
         }
   ```
   Or they can use Environment variable password provider:
   ```
         "inputSource": {
           "type": "s3",
           "prefixes": ["s3://foo/bar", "s3://bar/foo"],
           "properties": {
             "accessKeyId": {
                 "type": "environment", 
                 "variable": "<ENV.KEY.NAME.CONTAINING.ACCESS.KEY>"
              },
             "secretAccessKey": {
                 "type": "environment", 
                 "variable": "<ENV.KEY.NAME.CONTAINING.SECRET.ACCESS.KEY>"
              },
           }
         }
   ```
   Or they can have a mix (which will even be more beneficial when we have file path password provider and encrypt password provider):
   ```
         "inputSource": {
           "type": "s3",
           "prefixes": ["s3://foo/bar", "s3://bar/foo"],
           "properties": {
             "accessKeyId": {
                 "type": "default", 
                 "variable": "asdasdasdasd"
              },
             "secretAccessKey": {
                 "type": "environment", 
                 "variable": "<ENV.KEY.NAME.CONTAINING.SECRET.ACCESS.KEY>"
              },
           }
         }
   ```
   I think it's better to separate out the access key and secret key using separate password provider so user can be more flexible in choosing which password provider they want to use (if any at all) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm3 commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

Posted by GitBox <gi...@apache.org>.
maytasm3 commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305#issuecomment-583189773
 
 
   @zachjsh 
   - The client instantiation is done inside the peon process which is forked from the middleManager. It might be difficult to keep the old client around and reuse it. We can do that as a further optimization if required. Are you worried about the cost for initializing the client? 
   - We have PasswordProvider which abstract out the underlying mechanism to retrieve the password. There is additional work to implement file path password and env variable as key for encrypt/decrypt password.       
   - This solution does not support multiple cloud storage providers at one time. Supporting multiple cloud storage providers in single ingestion can be done using the the "Multi" input source idea. Out of scope for this proposal 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305#issuecomment-583880150
 
 
   Hi @maytasm3, thank you for the nice proposal! This feature will be super useful. I have two comments on the new fields.
   
   - I think the new fields are pretty straightforward and the prefix "override" is not necessary. It's also confusing when you read from s3 but use some other storage type as deep storage than s3.
   - I think we may need more input-source-specific configurations such as `endpoint.url`, `endpoint.signingRegion`, and so on in the future. What do you think about adding a new field containing "accessKeyId" and "secretAccessKey" fields as below?
   
   ```json
         "inputSource": {
           "type": "s3",
           "prefixes": ["s3://foo/bar", "s3://bar/foo"],
           "properties": {
             "accessKeyId": "123455",
             "secretAccessKey": "masdjlaksjdlakjsd"
           }
         }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm3 closed issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

Posted by GitBox <gi...@apache.org>.
maytasm3 closed issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm3 commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

Posted by GitBox <gi...@apache.org>.
maytasm3 commented on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305#issuecomment-584411444
 
 
   @jihoonson Thanks for the feedbacks. 
   
   - If I understand correctly, passwordProvider is an interface which can be backed by different ways to (optionally securely) access secrets. An existing implementation we have that I am planning to use (for first version of this feature) is EnvironmentVariablePasswordProvider. This will retrieve the password using the supplied environment key provided (the environment key names will be provided in the ingestion spec). We can use this for now since it is already implemented and is use to deal with passing password in other configs. Other ideas I have that can implement the passwordProvider to provide alternative ways of passing passwords are: 
   1.  We can make the ingestionSpec takes in a file path. The file path will contains the access key and secret key. The file will be on the machine or on cloud (using default cloud credential) 
   2.  We can make the ingestionSpec takes in an encrypted access key and secret key. The decryption key will be an environment variable that is set beforehand. Customer will have to encrypt the  access key and secret key themself before putting it in the ingestionSpec
   (This can be implement later if we find useful and in addition, can be use in other places that requires passing password though config/api/insecure channel. 
   - Good point. I agree. We can omit the "override" prefix. Since these new keys is part of the "inputSource", I do not think people will confuse it with other usage of s3 (since it is in the "inputSource" section). Anyhow, the "override" is not needed.
   - That's a good idea. I agree that in the future it might be necessary to support other properties of s3/cloud to make the ingestion more flexible (and since we already support taking in optional credentials, there is no reason architecturally not to support other properties). 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh edited a comment on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion

Posted by GitBox <gi...@apache.org>.
zachjsh edited a comment on issue #9305: Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
URL: https://github.com/apache/druid/issues/9305#issuecomment-581638464
 
 
   Thanks for proposing this @maytasm3, I think this would be a great addition. Just to confirm, if an ingestion spec was loaded with the same accessKey and secretAccessKey as another previous ingestions spec we would keep the old client around in this case and reuse it, or would we create a new client on the fly? It sounds like we already have a mechanism in place for encrypting fields in the ingestion spec, is that right? 
   
   Also would this solution support multiple cloud storage providers at one time?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org