You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/11/16 20:28:00 UTC

[GitHub] [beam] creste opened a new issue, #24210: [Feature Request]: Teach Azure Filesystem to authenticate using DefaultAzureCredential in the Python SDK

creste opened a new issue, #24210:
URL: https://github.com/apache/beam/issues/24210

   ### What would you like to happen?
   
   # Problem
   
   Currently, the Azure Filesystem for the Python SDK only supports authenticating using the [`AZURE_STORAGE_CONNECTION_STRING`](https://github.com/apache/beam/blob/b952b41788acc20edbe5b75b2196f30dbf8fdeb0/sdks/python/apache_beam/io/azure/blobstorageio.py#L109) environment variable.  That approach has several limitations:
   - The `AZURE_STORAGE_CONNECTION_STRING` environment variable must be defined on all systems where the pipeline executes.  This is difficult to configure when using Beam worker-pool sidecar containers with the FlinkRunner because Flink may be running in session mode with different Beam pipelines needing different connection strings.
   - The call to [`BlobServiceClient.from_connection_string()`](https://github.com/apache/beam/blob/b952b41788acc20edbe5b75b2196f30dbf8fdeb0/sdks/python/apache_beam/io/azure/blobstorageio.py#L111) does not support all of the authentication methods supported by [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential).  For my use case in particular, it does not support [Managed Identity](https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/overview) credentials.
   
   # Solution
   
   I plan to address the above limitations in a PR by adding new Azure-specific pipeline options described below.
   
   ## `--azure_blob_storage_connection_string`
   Specifies the [Azure Storage Connection String](https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string).
   
   Can be used instead of the `AZURE_STORAGE_CONNECTION_STRING` environment variable or the new `--azure_blob_storage_connection_string` pipeline option described below.
   
   Example:
   ```bash
   python -m apache_beam.examples.wordcount \
     --input azfs://devstoreaccount1/container/* \
     --output azfs://devstoreaccount1/container/py-wordcount-integration \
     --azure_blob_storage_connection_string "DefaultEndpointsProtocol=https;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=https://azurite:10000/devstoreaccount1;"
   ```
   ## `--azure_blob_storage_account_url`
   Specifies the [Azure Blob Storage Account Endpoint URL](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview#standard-endpoints).
   
   Can be used instead of the `AZURE_STORAGE_CONNECTION_STRING` environment variable or the new `--azure_blob_storage_connection_string` pipeline option described above.
   
   This pipeline option uses [`DefaultAzureCredential()`](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#authenticate-with-defaultazurecredential) to authenticate.
   
   Example:
   ```bash
   python -m apache_beam.examples.wordcount \
     --input azfs://devstoreaccount1/container/* \
     --output azfs://devstoreaccount1/container/py-wordcount-integration \
     --azure_blob_storage_account_url https://mystorageaccount.blob.core.windows.net/
   ```
   
   ## `--azure_managed_identity_client_id`
   Specifies the Managed Identity Client ID.  Can only be used with `--azure_blob_storage_account_url`.
   
   This pipeline option uses [`DefaultAzureCredential(managed_identity_client_id=client_id)`](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#specify-a-user-assigned-managed-identity-for-defaultazurecredential) to authenticate.
   
   Example:
   ```bash
   python -m apache_beam.examples.wordcount \
     --input azfs://devstoreaccount1/container/* \
     --output azfs://devstoreaccount1/container/py-wordcount-integration \
     --azure_blob_storage_account_url https://devstoreaccount1.blob.core.windows.net/ \
     --azure_managed_identity_client_id ca6cc1a3-4b82-48bd-97ca-8e799c0abff6
   ```
   # Testing
   Per https://github.com/apache/beam/issues/20511, the Azure Filesystem does not have integration tests against Azure or [Azurite](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio).  I plan to add integration tests for the new pipeline options to run against Azurite, similar to how [HDFS does its integration tests](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/io/hdfs_integration_test).
   
   
   
   ### Issue Priority
   
   Priority: 2
   
   ### Issue Component
   
   Component: io-py-ideas


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] creste commented on issue #24210: [Feature Request]: Teach Azure Filesystem to authenticate using DefaultAzureCredential in the Python SDK

Posted by GitBox <gi...@apache.org>.
creste commented on issue #24210:
URL: https://github.com/apache/beam/issues/24210#issuecomment-1317820530

   Thank you for pointing out those flags, @Abacn !  I renamed the flags to `--azure_connection_string` and `--blob_service_endpoint`.
   
   I can try changing `--azure_managed_identity_client_id` to do something similar to `--azureCredentialsProvider`. However, I don't have a good way to test [all of those authentication methods](https://github.com/apache/beam/blob/b952b41788acc20edbe5b75b2196f30dbf8fdeb0/sdks/java/io/azure/src/main/java/org/apache/beam/sdk/io/azure/options/AzureModule.java#L110-L146) using Azurite or Azure.  How do you think I should proceed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #24210: [Feature Request]: Teach Azure Filesystem to authenticate using DefaultAzureCredential in the Python SDK

Posted by GitBox <gi...@apache.org>.
Abacn commented on issue #24210:
URL: https://github.com/apache/beam/issues/24210#issuecomment-1317718677

   Thanks @creste for raising this! It is an important issue. There is a `--azureConnectionString`, `--blobServiceEndpoint` option in our Java SDK and some other options available in Java but not in Python. Ideally we would like to have the similar naming and specifications. Could we use `--azure_connection_string`, `--blob_service_endpoint`; for `--azure_managed_identity_client_id` I have not find a relevant one in our Java SDK but maybe there is one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] creste commented on issue #24210: [Feature Request]: Teach Azure Filesystem to authenticate using DefaultAzureCredential in the Python SDK

Posted by GitBox <gi...@apache.org>.
creste commented on issue #24210:
URL: https://github.com/apache/beam/issues/24210#issuecomment-1317643262

   .take-issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #24210: [Feature Request]: Teach Azure Filesystem to authenticate using DefaultAzureCredential in the Python SDK

Posted by GitBox <gi...@apache.org>.
Abacn commented on issue #24210:
URL: https://github.com/apache/beam/issues/24210#issuecomment-1317827930

   > Thank you for pointing out those flags, @Abacn ! I renamed the flags to `--azure_connection_string` and `--blob_service_endpoint`.
   > 
   > I can try changing `--azure_managed_identity_client_id` to do something similar to `--azureCredentialsProvider`. However, I don't have a good way to test [all of those authentication methods](https://github.com/apache/beam/blob/b952b41788acc20edbe5b75b2196f30dbf8fdeb0/sdks/java/io/azure/src/main/java/org/apache/beam/sdk/io/azure/options/AzureModule.java#L110-L146) using Azurite or Azure. How do you think I should proceed?
   
   That's fine. We can support only managed identity client for now and document it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn closed issue #24210: [Feature Request]: Teach Azure Filesystem to authenticate using DefaultAzureCredential in the Python SDK

Posted by GitBox <gi...@apache.org>.
Abacn closed issue #24210: [Feature Request]: Teach Azure Filesystem to authenticate using DefaultAzureCredential in the Python SDK
URL: https://github.com/apache/beam/issues/24210


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org