You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "gabor-one (via GitHub)" <gi...@apache.org> on 2024/04/15 12:53:07 UTC
[I] Airflow produces an unnecessary ' ' (space) in the middle of the WASB URL when WASB connection is read from Azure Key Vault secret backed. [airflow]
gabor-one opened a new issue, #39028:
URL: https://github.com/apache/airflow/issues/39028
### Apache Airflow Provider(s)
microsoft-azure
### Versions of Apache Airflow Providers
apache-airflow-providers-microsoft-azure==9.0.1
### Apache Airflow version
2.9.0
### Operating System
Debian GNU/Linux 12 (bookworm)
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
- Platform: Kubernetes (AKS)
- Executor: KubernetesExecutor
- Using Azure Key-Vault as the secret provider via Workload Identity.
- Using azure_remote_logging (Azure Blob Storage)
### What happened
If the connection is defined in Azure Key Vault then the task pods cannot write logs to Azure Blob Storage at the end of the execution. There is a random ' ' (Space character) in the storage account URL (see the last line in log).
WASB Airflow connection is defined as this in Key Vault: `wasb://https://<STORAGE_ACCOUNT_NAME>.blob.core.windows.net`
If the connection is created via UI and 'remote_log_conn_id' is changed to use that connection for logging everything works fine.
Logs:
```
[2024-04-15, 12:03:07 UTC] {retries.py:91} DEBUG - Running Job._fetch_from_db with retries. Try 1 of 3
[2024-04-15, 12:03:07 UTC] {retries.py:91} DEBUG - Running Job._update_heartbeat with retries. Try 1 of 3
[2024-04-15, 12:03:07 UTC] {job.py:214} DEBUG - [heartbeat]
[2024-04-15, 12:03:12 UTC] {retries.py:91} DEBUG - Running Job._fetch_from_db with retries. Try 1 of 3
[2024-04-15, 12:03:12 UTC] {retries.py:91} DEBUG - Running Job._update_heartbeat with retries. Try 1 of 3
[2024-04-15, 12:03:12 UTC] {job.py:214} DEBUG - [heartbeat]
[2024-04-15, 12:03:13 UTC] {taskinstance.py:441} ▼ Post task execution logs
[2024-04-15, 12:03:13 UTC] {taskinstance.py:2890} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 1173, in _create_direct_connection
hosts = await asyncio.shield(host_resolved)
File "/home/airflow/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 884, in _resolve_host
addrs = await self._resolver.resolve(host, port, family=self._family)
File "/home/airflow/.local/lib/python3.10/site-packages/aiohttp/resolver.py", line 33, in resolve
infos = await self._loop.getaddrinfo(
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 863, in getaddrinfo
return await self.run_in_executor(
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/socket.py", line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/transport/_aiohttp.py", line 294, in send
result = await self.session.request( # type: ignore
File "/home/airflow/.local/lib/python3.10/site-packages/aiohttp/client.py", line 578, in _request
conn = await self._connector.connect(
File "/home/airflow/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 544, in connect
proto = await self._create_connection(req, traces, timeout)
File "/home/airflow/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 911, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
File "/home/airflow/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 1187, in _create_direct_connection
raise ClientConnectorError(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host <STORAGE_ACCOUNT_NAME> .blob.core.windows.net:443 ssl:default [Name or service not known]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 465, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 432, in _execute_callable
return execute_callable(context=context, **execute_callable_kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 400, in wrapper
return func(self, *args, **kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/decorators/base.py", line 265, in execute
return_value = super().execute(context)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 400, in wrapper
return func(self, *args, **kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 235, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 252, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/dags/repo/src/workflows/test.py", line 24, in test_features
print(f"Got access to datalake. ls: {fs.ls(datalake_folder)}")
File "/home/airflow/.local/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
raise return_result
File "/home/airflow/.local/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
File "/home/airflow/.local/lib/python3.10/site-packages/adlfs/spec.py", line 823, in _ls
output = await self._ls_blobs(
File "/home/airflow/.local/lib/python3.10/site-packages/adlfs/spec.py", line 724, in _ls_blobs
async for next_blob in blobs:
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/async_paging.py", line 142, in __anext__
return await self.__anext__()
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/async_paging.py", line 145, in __anext__
self._page = await self._page_iterator.__anext__()
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/async_paging.py", line 94, in __anext__
self._response = await self._get_next(self.continuation_token)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/storage/blob/aio/_list_blobs_helper.py", line 83, in _get_next_cb
return await self._command(
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/tracing/decorator_async.py", line 77, in wrapper_use_tracer
return await func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/storage/blob/_generated/aio/operations/_container_operations.py", line 1886, in list_blob_hierarchy_segment
pipeline_response: PipelineResponse = await self._client._pipeline.run( # pylint: disable=protected-access
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 221, in run
return await first_node.send(pipeline_request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send
response = await self.next.send(request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send
response = await self.next.send(request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send
response = await self.next.send(request)
[Previous line repeated 3 more times]
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/policies/_authentication_async.py", line 100, in send
response = await self.next.send(request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send
response = await self.next.send(request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/policies/_redirect_async.py", line 73, in send
response = await self.next.send(request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send
response = await self.next.send(request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/storage/blob/_shared/policies_async.py", line 137, in send
raise err
File "/home/airflow/.local/lib/python3.10/site-packages/azure/storage/blob/_shared/policies_async.py", line 111, in send
response = await self.next.send(request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send
response = await self.next.send(request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/storage/blob/_shared/policies_async.py", line 64, in send
response = await self.next.send(request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send
response = await self.next.send(request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 69, in send
response = await self.next.send(request)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/_base_async.py", line 106, in send
await self._sender.send(request.http_request, **request.context.options),
File "/home/airflow/.local/lib/python3.10/site-packages/azure/storage/blob/_shared/base_client_async.py", line 175, in send
return await self._transport.send(request, **kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/azure/core/pipeline/transport/_aiohttp.py", line 332, in send
raise ServiceRequestError(err, error=err) from err
azure.core.exceptions.ServiceRequestError: Cannot connect to host <STORAGE_ACCOUNT_NAME> .blob.core.windows.net:443 ssl:default [Name or service not known]
```
WASB-DEFAULT connection defined in the Key-Vault that produces a random space in the URL:
```
>airflow connections get wasb-default -o yaml
- conn_id: wasb-default
conn_type: wasb
description: null
extra_dejson: {}
get_uri: wasb://https://<STORAGE_ACCOUNT_NAME>.blob.core.windows.net
host: https://<STORAGE_ACCOUNT_NAME>.blob.core.windows.net
id: null
is_encrypted: null
is_extra_encrypted: null
login: null
password: null
port: null
schema: ''
```
WASB connection defined via UI that works:
```
>airflow connections get abc -o yaml
- conn_id: abc
conn_type: wasb
description: ''
extra_dejson: {}
get_uri: wasb://https://<STORAGE_ACCOUNT_NAME>.blob.core.windows.net
host: https://<STORAGE_ACCOUNT_NAME>.blob.core.windows.net
id: '1'
is_encrypted: 'False'
is_extra_encrypted: 'False'
login: ''
password: null
port: null
schema: ''
```
### What you think should happen instead
WASB connections defined via Key-Vault should not produce an extra ' ' (space) character in the URL for no reason just as connections create via UI don't.
### How to reproduce
1. Setup Azure Kubernetes to use Workload Identity. Attach service account to pods. Federate identity to service account. Give that federated identity access to Azure Storage Account.
2. Configure Airflow to use Azure Key-Vault as secret backend.
3. Configure Airflow to use azure_remote_logging.
4. Create an Airflow WASB connection secret in Key-Vault. Use example from above.
5. Run a DAG.
6. Task will fail due to task will not be able to write logs to Storage Container.
### Anything else
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] Airflow produces an unnecessary ' ' (space) in the middle of the WASB URL when WASB connection is read from Azure Key Vault secret backed. [airflow]
Posted by "boring-cyborg[bot] (via GitHub)" <gi...@apache.org>.
boring-cyborg[bot] commented on issue #39028:
URL: https://github.com/apache/airflow/issues/39028#issuecomment-2056784221
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] Airflow produces an unnecessary ' ' (space) in the middle of the WASB URL when WASB connection is read from Azure Key Vault secret backed. [airflow]
Posted by "gabor-one (via GitHub)" <gi...@apache.org>.
gabor-one closed issue #39028: Airflow produces an unnecessary ' ' (space) in the middle of the WASB URL when WASB connection is read from Azure Key Vault secret backed.
URL: https://github.com/apache/airflow/issues/39028
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] Airflow produces an unnecessary ' ' (space) in the middle of the WASB URL when WASB connection is read from Azure Key Vault secret backed. [airflow]
Posted by "gabor-one (via GitHub)" <gi...@apache.org>.
gabor-one commented on issue #39028:
URL: https://github.com/apache/airflow/issues/39028#issuecomment-2057211976
I made a mistake reading the log. You don't need https:// in the connection URL. Please ignore this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org