You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/01/27 22:20:27 UTC
[GitHub] [airflow] jonasrla opened a new issue #13934: EMR URI Parse Fails to resolve nested dict objects
jonasrla opened a new issue #13934:
URL: https://github.com/apache/airflow/issues/13934
**Apache Airflow version**:
1.10.12
**Environment**:
- **Cloud provider or hardware configuration**:
- MacBook Pro Mic 2014
- **OS** (e.g. from /etc/os-release):
- macOS Big Sur 11.1
- **Kernel** (e.g. `uname -a`):
- 20.2.0 Darwin Kernel Version 20.2.0: Wed Dec 2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64 x86_64
**What happened**:
I was trying to setup an EMR connection using the Environment Variable feature, but found out there are limitations. While parsing a config that looks like this
```json
{
"Instances": {
"MasterInstanceType": "m5.xlarge",
"SlaveInstanceType": "m5.xlarge",
"InstanceCount": 2,
"Ec2SubnetId": "subnet-XXXXXXXXXXXXXXXXX"
},
"ServiceRole": "EMR_DefaultRole",
"JobFlowRole": "EMR_EC2_DefaultRole",
"ReleaseLabel": "emr-5.32.0",
"Applications": [{
"Name": "Spark"
}],
"Configurations": [{
"Classification": "spark-hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}]
}
```
I got this
![Screen Shot 2021-01-27 at 18 41 26](https://user-images.githubusercontent.com/4976959/106059423-773f4a80-60d1-11eb-918c-4ce0310a195a.png)
**What you expected to happen**:
I expected the json were completely parsed
**How to reproduce it**:
Export some `EMR` connection in your environment such as
```
AIRFLOW_CONN_EMR_ALT='emr://?Instances=%7B%27MasterInstanceType%27%3A+%27m5.xlarge%27%2C+%27SlaveInstanceType%27%3A+%27m5.xlarge%27%2C+%27InstanceCount%27%3A+2%2C+%27Ec2SubnetId%27%3A+%27subnet-XXXXXXXXXXXXXXXXX%27%7D&ServiceRole=EMR_DefaultRole&JobFlowRole=EMR_EC2_DefaultRole&ReleaseLabel=emr-5.32.0&Applications=%5B%7B%27Name%27%3A+%27Spark%27%7D%5D&Configurations=%5B%7B%27Classification%27%3A+%27spark-hive-site%27%2C+%27Properties%27%3A+%7B%27hive.metastore.client.factory.class%27%3A+%27com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory%27%7D%7D%5D'
```
and try to connect with `emr_alt` connection
**Anything else we need to know**:
I already located the line:
https://github.com/apache/airflow/blob/1.10.12/airflow/models/connection.py#L148
https://github.com/apache/airflow/blob/2.0.0/airflow/models/connection.py#L164
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] ashb commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
ashb commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-808389284
I think generally the extra should be encoded as query-string-escaped json, so:
```
%7B%22a%22%3A%7B%22b%22%3A%221%22%2C%22c%22%3A%7B%22d%22%3A%222%22%7D%7D%2C%22e%22%3A%223%22%7D
```
It's _unwieldly as anything_
```python console
In [18]: conn = airflow.models.connection.Connection(uri='emr:///?foo=%7B%22a%22%3A%7B%22b%22%3A%221%22%2C%22c%22%3A%7B%22d%22%3A%222%22%7D%7D%2C%22e%22%3A%223%22%7D');
In [19]: conn.extra_dejson
Out[19]: {'foo': '{"a":{"b":"1","c":{"d":"2"}},"e":"3"}'}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] boring-cyborg[bot] commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-768616746
Thanks for opening your first issue here! Be sure to follow the issue template!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jonasrla edited a comment on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
jonasrla edited a comment on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-769876102
Just had a better idea, what about this
```python
self.extra = self.parse_extra(uri_parts.query)
def parse_extra(query):
return json.loads("{" +
", ".join([f""""{pair[0]}": {pair[1].replace("'", '"')}"""
for pair in parse_qsl(query,
keep_blank_values=True)]) +
"}")
```
basically create a parsable json out of the `parse_qsl` output and let `json.loads` do what it does best
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] ashb commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
ashb commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-808394042
How would you handle these two cases:
```json
{ "a": [ "one"] }
```
and
```json
{ "a": { "0": "one' } }
```
It's an edge case for sure, but we need to at least document how it behaves.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] ashb edited a comment on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
ashb edited a comment on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-808389284
I think generally the extra should be encoded as query-string-escaped json, so:
```
%7B%22a%22%3A%7B%22b%22%3A%221%22%2C%22c%22%3A%7B%22d%22%3A%222%22%7D%7D%2C%22e%22%3A%223%22%7D
```
It's _unwieldly as anything_
```python console
In [18]: conn = airflow.models.connection.Connection(uri='emr:///?foo=%7B%22a%22%3A%7B%22b%22%3A%221%22%2C%22c%22%3A%7B%22d%22%3A%222%22%7D%7D%2C%22e%22%3A%223%22%7D');
In [19]: conn.extra_dejson
Out[19]: {'foo': '{"a":{"b":"1","c":{"d":"2"}},"e":"3"}'}
```
Oh hmm, that's still a string.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] vikramkoka commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
vikramkoka commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-806302038
Sorry @jonasrla missed this earlier.
I don't know the answer to this, but I am sure @ashb does.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] mik-laj commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-769601371
@jonasrla Now I understand the problem and I knew it existed, but I have no idea for a solution.
Related: https://github.com/apache/airflow/issues/9060
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jonasrla commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
jonasrla commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-778779856
Hello, I'm back looking into this. The tests just failed with my modification.
I'm shouting out so our work don't clash.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jonasrla edited a comment on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
jonasrla edited a comment on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-769856714
I already thought a little about it. What do you think of doing this:
```python
self.extra = self.parse_extra(uri_parts.query)
def parse_extra(query):
extra = dict()
for key, value in parse_qsl(query, keep_blank_values=True):
try:
extra[key] = json.loads(value.replace("'", '"'))
except json.JSONDecodeError:
extra[key] = value
return json.dumps(extra)
```
That is, instead of using `dict()` as parser, use `json.loads()`. The only issue is that I'm using `try catch` to decide whether the value is a valid dictionary or just a string. I, myself, am not a huge fan of using `try catch` where it should be `if else`.
What do you think?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] dstandish commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
dstandish commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-814578432
#15100 provides a solution here if anyone wants to review
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jonasrla edited a comment on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
jonasrla edited a comment on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-778790585
Hey guys,
I'm considering now which way should we encode our nested objects.
So looked up online and this issue and found [this discussion in angular project](https://github.com/angular/angular/issues/24868#issuecomment-409667751)
So I suggest noting our nested object like
```
{
'extra1': {
'nested_extra1': 'a value'
},
'extra2': 'other value'
}
```
as
`extra1%5Bnested_extra1%5D=a%20value&extra2=other%20value`
Extreme cases like
```
{
"a":
{"b": "1",
"c": {"d": "2"}
},
"e": "3"
}
```
would look like `a%5Bb%5D=1&a%5Bc%5D%5Bd%5D=2&e=3`
And that does happen with AWS EMR connections. Checkout this [request form](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow)
So any thoughts? Do you believe this way works?
@vikramkoka, @mik-laj
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jonasrla edited a comment on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
jonasrla edited a comment on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-769876102
Just had a better idea, what about this
```python
self.extra = self.parse_extra(uri_parts.query)
def parse_extra(query):
return json.loads(f"""
{{
{", ".join(['"' + pair[0] + '": ' + pair[1].replace("'", '"')
for pair in parse_qsl(query,
keep_blank_values=True)])}
}}""")
```
basically create a parsable json out of the `parse_qsl` output and let `json.loads` do what it does best
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] mik-laj commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-769900193
Will this work with connections for GCP? It use the SA key (JSON file) as a parameter.
Example service account key file:
```
{
"type": "service_account",
"project_id": "project-id",
"private_key_id": "key-id",
"private_key": "-----BEGIN PRIVATE KEY-----\nprivate-key\n-----END PRIVATE KEY-----\n",
"client_email": "service-account-email",
"client_id": "client-id",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/service-account-email"
}
```
More information:
http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow-providers-google/latest/connections/gcp.html#configuring-the-connection
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jonasrla commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
jonasrla commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-778790585
Hey guys,
I'm considering now which way should we encode our nested objects.
So looked up online and this issue and found [this discussion in angular project](https://github.com/angular/angular/issues/24868#issuecomment-409667751)
So I suggest noting our nested object like
```
{
'extra1': {
'nested_extra1': 'a value'
},
'extra2': 'other value'
}
```
as
`extra1[nested_extra1]=a%20value&extra2=other%20value`
Extreme cases like
```
{
"a":
{"b": "1",
"c": {"d": "2"}
},
"e": "3"
}
```
would look like `a[b]=1&a[c][d]=2&e=3`
And that does happen with AWS EMR connections. Checkout this [request form](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow)
So any thoughts? Do you believe this way works?
@vikramkoka, @mik-laj
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jonasrla commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
jonasrla commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-769020404
Thank you a for the reply, @mik-laj.
I'll change the subject a little to help you understand the real issue here. That's not an EMR Hook bug, that's a connection bug. To reproduce it simply run [this gist](https://gist.github.com/jonasrla/1ee339822866d0fb5e68f0a8de8860b5) on Airflow 2.0 or 1.10
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] dstandish closed issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
dstandish closed issue #13934:
URL: https://github.com/apache/airflow/issues/13934
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] mik-laj commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-769604271
In the case of Presto, I just limited the use of nested keys and used a simple dictionary.
https://github.com/apache/airflow/blob/4d4aa9ec2941b13005204b88b8335c2599d075ac/airflow/providers/presto/hooks/presto.py#L74-L84
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] mik-laj closed issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
mik-laj closed issue #13934:
URL: https://github.com/apache/airflow/issues/13934
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] mik-laj commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-768767780
`airflow/contrib/hooks/emr_hooks.py` are deprecated and we don't support it anymore. Please migrate to backport providers. See: https://airflow.apache.org/docs/apache-airflow/stable/backport-providers.html
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jonasrla edited a comment on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
jonasrla edited a comment on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-769876102
Just had a better idea, what about this
```python
self.extra = self.parse_extra(uri_parts.query)
def parse_extra(query):
return json.loads("{" +
", ".join([f""""{pair[0]}": {pair[1].replace("'", '"')}"""
for pair in parse_qsl(query,
keep_blank_values=True)]) +
"}")
```
basically create a parsable json out of the `parse_qsl` output and let `json.loads` do what it does best
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jonasrla commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
jonasrla commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-769876102
Just had a better idea, what about this
```python
self.extra = self.parse_extra(uri_parts.query)
def parse_extra(query):
return json.loads("{" +
", ".join([f""""{pair[0]}": {pair[1].replace("'", '"')}"""
for pair in parse_qsl(query,
keep_blank_values=True)]) +
"}")
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] dstandish commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
dstandish commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-820080029
closing since this is resolved by https://github.com/apache/airflow/pull/15100
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jonasrla commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
jonasrla commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-769856714
I already thought a little about it. What do you think of doing this:
```python
self.extra = self.parse_extra(uri_parts.query)
def parse_extra(query):
extra = dict()
for key, value in parse_qsl(query, keep_blank_values=True):
try:
extra[key] = json.loads(value.replace("'", '"'))
except json.JSONDecodeError:
extra[key] = value
return json.dumps(extra)
```
That is, instead of using dict as parser, use json.loads. The only issue is that I'm using `try catch` to decide whether the value is a valid dictionary or just a string. I, myself, am not a huge fan of using `try catch` where it should be `if else`.
What do you think?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] jonasrla commented on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
jonasrla commented on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-769888083
actually we don't need `json.loads` if it stores as `json.dumps`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] ashb edited a comment on issue #13934: EMR URI Parse Fails to resolve nested dict objects
Posted by GitBox <gi...@apache.org>.
ashb edited a comment on issue #13934:
URL: https://github.com/apache/airflow/issues/13934#issuecomment-808389284
I think generally the extra should be encoded as query-string-escaped json, so:
```
%7B%22a%22%3A%7B%22b%22%3A%221%22%2C%22c%22%3A%7B%22d%22%3A%222%22%7D%7D%2C%22e%22%3A%223%22%7D
```
It's _unwieldly as anything_
```python console
In [18]: conn = airflow.models.connection.Connection(uri='emr:///?foo=%7B%22a%22%3A%7B%22b%22%3A%221%22%2C%22c%22%3A%7B%22d%22%3A%222%22%7D%7D%2C%22e%22%3A%223%22%7D');
In [19]: conn.extra_dejson
Out[19]: {'foo': '{"a":{"b":"1","c":{"d":"2"}},"e":"3"}'}
```
Oh hmm, that's still a string.
So there isn't away way of providing a nested structure via URI currently, so we'll have to come up with something.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org