You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/31 18:16:33 UTC

[GitHub] [airflow] dwreeves opened a new pull request, #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

dwreeves opened a new pull request, #25432:
URL: https://github.com/apache/airflow/pull/25432

   This PR addresses #25104
   
   ---
   
   # Changes
   
   ## 1. JSON secrets do not need to be URL-encoded when `full_url_mode=False`.
   
   ^ The whole reason for this PR. This is the main behavior that is being implemented.
   
   Specifically, this behavior is implemented for when `get_connection()` is called. This method returns a `Optional[Connection]` object. The `Connection()` can be built either using a `uri=?`, or with kwargs corresponding with the parts of the URI, e.g. `host=?`, `port=?`, etc. In the former case, URL-encoding is required; in the latter case, it is not.
   
   Users will receive a DeprecationWarning if the code detects that their secrets are URL-encoded (more on that in section 2 below). In most cases, the user should be able to simply decode their secret and everything will continue working normally.
   
   I tried to make this change, as well as the other changes, as quietly as possible for typical Airflow runtimes. Basically, if the user isn't doing anything weird, the only thing they will be required to do is change their secrets from being URL-encoded to decoded.
   
   To support this behavior, a few additional things were either needed to be changed, or made a lot of sense to change. The rest of this message will describe what those additional changes are.
   
   ## 2. Added `secret_values_are_urlencoded=?` kwarg for `SecretsManagerBackend.__init__`, albeit most users do not need to touch this.
   
   @potiuk suggested adding a kwarg for assisting in migration. I wanted to avoid this if necessary because it is very obtrusive and causes a negative user experience.
   
   Is it possible to avoid needing to reconfigure the secrets manager backend? Yes, in most cases!
   
   What I dicovered is that **if decoding the URL-encoded values is idempotent, the user needs to decode their secrets, but the `backend_kwargs` config option does not need to be adjusted.** The reason why idempotency obviates a need to touch the config is that idempotency means the intended value of the secret is unambiguous once the values are decoded. This is a pretty big win from a user experience perspective because adjusting the Airflow configuration can be a nuisance in practice (e.g. a developer might need to get a system administrator involved).
   
   I add a longer explanation in the comments for the code:
   
   ```python
   # When ``unquote(v) == v``, then removing unquote won't affect the user, regardless of
   # whether or not ``v`` is URL-encoded. For example, "foo bar" is not URL-encoded. But
   # because decoding it doesn't affect the value, then it will migrate safely when
   # ``unquote`` gets removed.
   #
   # When parameters are URL-encoded, but decoding is idempotent, we need to warn the user
   # to un-escape their secrets. For example, if "foo%20bar" is a URL-encoded string, then
   # decoding is idempotent because ``unquote(unquote("foo%20bar")) == unquote("foo%20bar")``.
   #
   # In the rare situation that value is URL-encoded but the decoding is _not_ idempotent,
   # this causes a major issue. For example, if "foo%2520bar" is URL-encoded, then decoding is
   # _not_ idempotent because ``unquote(unquote("foo%2520bar")) != unquote("foo%2520bar")``
   #
   # This causes a problem for migration because if the user decodes their value, we cannot
   # infer that is the case by looking at the decoded value (from our vantage point, it will
   # look to be URL-encoded.)
   #
   # So when this uncommon situation occurs, the user _must_ adjust the configuration and set
   # ``parameters_are_urlencoded`` to False to migrate safely. In all other cases, we do not
   # need the user to adjust this object to migrate; they can transition their secrets with
   # the default configuration.
   ```
   
   The kwarg is needed in the very rare case that the kwarg is not idempotent, i.e. the string literal for the decoded secret contains a `%`. In this case, the user will need to manually set `secret_values_are_urlencoded` to `False`.
   
   ## 3. Send DeprecationWarning in a niche situation for `get_conn_value()`.
   
   `get_conn_value()` is allowed to return a JSON as of 2.3.0.
   
   However, in the unlikely case that someone is both (1) using `get_conn_value()` directly, and (2) is using `full_url_mode=False`, we want to warn them that they will no longer receive a URL in the future.
   
   2. The base implementation of `get_connection()` will generate a `Connection` object when `get_conn_value()` returns a JSON-- or more conceptually, when the secret is a valid JSON.
   
   2. When `get_conn_value()` returns a JSON, `get_connection()` is able to create a `Connection` object from the JSON.
      a. Crucially, this does _not_ require URL-encoding for the base Airflow APIs.
   
   This is really challenging to do elegantly if the method `SecretsManagerBackend.get_conn_value()` needs to retain 100% backwards compatibility. By that I mean: if it is never allowed to return a JSON string representation of the secret.
   
   For example, if `_get_secret()` returns a string `'{"host": "foo", "conn_type": "postgres", "schema": "mydb"}'`, then the current behavior is that `SecretsManagerBackend.get_conn_value()` will return a string `'postgres://:@foo/mydb'`.
   
   Under the base class's implementation, `BaseSecretsBackend.get_conn_value()` _is_ allowed to return a JSON string. But `SecretsManagerBackend` _never_ does that. If the behavior of the overridden is relaxed to allow for JSON strings as per the base implementation, then this becomes a little easier to do without writing complete spaghetti.
   
   For "typical" Airflow API usage there is no harm because `get_conn_value()` is not typically called directly. However, this is a pretty big package, and lots of people do lots of things you might not expect, so I opted to go with smoothly transitioning away from returning a URI.
   
   ## 4. Deprecate `ast.literal_eval` (i.e. support for dict reprs) for JSON decoding.
   
   I want to be diplomatic, but I also want to get to the point, so please do not interpret this negatively. Here is the original code:
   
   ```python
   try:
       secret_string = self._get_secret(self.connections_prefix, conn_id)
       # json.loads gives error
       secret = ast.literal_eval(secret_string) if secret_string else None
   except ValueError:  # 'malformed node or string: ' error, for empty conns
       connection = None
       secret = None
   ```
   
   ^ This is a somewhat common "mistake" in Python. The "mistake" being that someone `print()`s a dict, and then they CTRL+C the `repr` for the dict, and they think it's a JSON. So when they then CTRL+V the string somewhere that expects a JSON, they hit a JSON decode error. And then instead of thinking, "I made a mistake," they believe that they've uncovered a bug in their first couple hundred hours of programming in a core Python module that has existed for decades. (Notably, the comment in the Python code says "json.loads gives error", which confirms this story.)
   
   There are two reasons to remove `ast.literal_eval` and just use `json.loads`. The first reason is a bit more philosophical, which is that the API should not support an odd implementation and should not hand-hold for mistakes at this level of simplicity.
   
   The second reason is to provide a little more consistency across the Airflow API:
   
   - `{'conn_type': 'postgres', 'host': 'postgres'}` is not a valid secret elsewhere in Airflow, but `{"conn_type": "postgres", "host": "postgres"}` is.
   - `get_conn_value()` is allowed to return a JSON string, but not a dict repr. When we migrate toward `get_conn_value()` returning a JSON string, we should discourage use of dict reprs.
   
   The original PR that implemented the `ast.literal_eval()` approach (#15104) was primarily focused on adding more support for various aliases for keys. Overall, this a fine addition to the code. For various reasons, maintainers should be committed to retaining this feature. However, the use of `ast.literal_eval` was never part of the discussion for this specification; there was not a PR specifically devoted to migrating `json.loads` to `ast.literal_eval`.
   
   Removing it also should not be disruptive to the vast majority of people, since:
   
   - The Secrets Manager UI defaults to parsing manually input key-value pairs as a JSON.
   - If you input a custom string that is a dict repr but not a JSON, the UI will show an error. It would be unusual for a user to submit this as a secret, since the user would be ignoring the UI's warnings. (I just tested and confirmed manually that this is the case by creating a secret that is `{'foo': 'bar'}`.)
   
   ## 5. Support `extra` being either a JSON string or a dict
   
   AWS Secrets Manager allows for storage of arbitrary strings. For example, both of these are valid secrets to store within the Secrets Manager:
   
   ```json
   {"extra": "{\\"foo\\": \\"bar\\"}"}
   ```
   
   and
   
   ```json
   {"extra": {"foo": "bar"}}
   ```
   
   Right now, the former is supported but not the latter. There doesn't seem to be a good reason why the latter should not be supported, other than that the AWS UI will fail to parse key-value pairs. But it's still a valid secret!
   
   ## 6. Added some type annotations.
   
   Some function signatures were missing type annotations, so I added them in. All of the type annotations for overridden methods adhere to the signatures from the base implementation of the class.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dwreeves commented on pull request #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

Posted by GitBox <gi...@apache.org>.
dwreeves commented on PR #25432:
URL: https://github.com/apache/airflow/pull/25432#issuecomment-1205370681

   Sorry about that, I'll get to this later today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25432:
URL: https://github.com/apache/airflow/pull/25432#issuecomment-1207230493

   Let me know @vincbeck if you have more comments, otherwise I merge it before releasing providers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dwreeves commented on pull request #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

Posted by GitBox <gi...@apache.org>.
dwreeves commented on PR #25432:
URL: https://github.com/apache/airflow/pull/25432#issuecomment-1205358393

   @potiuk The tests that are failing are caused by the upstream, not any changes in this commit. (The sqlite tests in particular are what are failing.) Looks like the CI is broken.
   
   The tests were passing prior to merging with the upstream.
   
   Should I rollback the merge commit that causes the tests to fail.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25432:
URL: https://github.com/apache/airflow/pull/25432#issuecomment-1205354714

   Also doc build fail


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25432:
URL: https://github.com/apache/airflow/pull/25432#issuecomment-1205354246

   Can you please make tests pass before anyone deeply dives?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dwreeves commented on pull request #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

Posted by GitBox <gi...@apache.org>.
dwreeves commented on PR #25432:
URL: https://github.com/apache/airflow/pull/25432#issuecomment-1206020730

   @potiuk Tests passing. Sorry about that; I had a typo in one of the tests I added, and I'm still not entirely sure why the Sphinx issue happened, but it's fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk merged pull request #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

Posted by GitBox <gi...@apache.org>.
potiuk merged PR #25432:
URL: https://github.com/apache/airflow/pull/25432


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] vincbeck commented on pull request #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

Posted by GitBox <gi...@apache.org>.
vincbeck commented on PR #25432:
URL: https://github.com/apache/airflow/pull/25432#issuecomment-1208181437

   I know the PR is already merged but LGTM :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] vincbeck commented on a diff in pull request #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

Posted by GitBox <gi...@apache.org>.
vincbeck commented on code in PR #25432:
URL: https://github.com/apache/airflow/pull/25432#discussion_r939093095


##########
airflow/providers/amazon/aws/secrets/secrets_manager.py:
##########
@@ -94,6 +95,9 @@ class SecretsManagerBackend(BaseSecretsBackend, LoggingMixin):
     :param full_url_mode: if True, the secrets must be stored as one conn URI in just one field per secret.
         If False (set it as false in backend_kwargs), you can store the secret using different
         fields (password, user...).
+    :param secret_values_are_urlencoded: If True, and full_url_mode is False, then the values are assumed to

Review Comment:
   For such boolean value, the verb is/are is always at the beginning
   
   ```suggestion
       :param are_secret_values_urlencoded: If True, and full_url_mode is False, then the values are assumed to
   ```



##########
airflow/providers/amazon/aws/secrets/secrets_manager.py:
##########
@@ -150,31 +175,154 @@ def _format_uri_with_extra(secret, conn_string):
 
         return conn_string
 
-    def get_uri_from_secret(self, secret):
+    def get_connection(self, conn_id: str) -> Optional[Connection]:
+        if not self.full_url_mode:
+            secret_string = self._get_secret(self.connections_prefix, conn_id)
+            secret_dict = self._deserialize_json_string(secret_string)
+
+            if not secret_dict:
+                return None
+
+            if 'extra' in secret_dict and isinstance(secret_dict['extra'], str):
+                secret_dict['extra'] = self._deserialize_json_string(secret_dict['extra'])
+
+            data = self._standardize_secret_keys(secret_dict)
+
+            if self.secret_values_are_urlencoded:
+                data = self._remove_escaping_in_secret_dict(secret=data, conn_id=conn_id)
+
+            port: Optional[int] = None
+
+            if data['port'] is not None:
+                port = int(data['port'])
+
+            return Connection(
+                login=data['user'],
+                password=data['password'],
+                host=data['host'],
+                port=port,
+                schema=data['schema'],
+                conn_type=data['conn_type'],
+                extra=data['extra'],
+            )
+
+        return super().get_connection(conn_id=conn_id)
+
+    def _standardize_secret_keys(self, secret: Dict[str, Any]) -> Dict[str, Any]:
+        """Standardize the names of the keys in the dict. These keys align with"""
         possible_words_for_conn_fields = {
             'user': ['user', 'username', 'login', 'user_name'],
             'password': ['password', 'pass', 'key'],
             'host': ['host', 'remote_host', 'server'],
             'port': ['port'],
             'schema': ['database', 'schema'],
             'conn_type': ['conn_type', 'conn_id', 'connection_type', 'engine'],
+            'extra': ['extra'],
         }
 
         for conn_field, extra_words in self.extra_conn_words.items():
             possible_words_for_conn_fields[conn_field].extend(extra_words)
 
-        conn_d = {}
+        conn_d: Dict[str, Any] = {}
         for conn_field, possible_words in possible_words_for_conn_fields.items():
             try:
                 conn_d[conn_field] = [v for k, v in secret.items() if k in possible_words][0]
             except IndexError:
-                conn_d[conn_field] = ''
+                conn_d[conn_field] = None
 
-        conn_string = "{conn_type}://{user}:{password}@{host}:{port}/{schema}".format(**conn_d)
+        return conn_d
 
+    def get_uri_from_secret(self, secret: Dict[str, str]) -> str:
+        conn_d: Dict[str, str] = {k: v if v else '' for k, v in self._standardize_secret_keys(secret).items()}
+        conn_string = "{conn_type}://{user}:{password}@{host}:{port}/{schema}".format(**conn_d)
         return self._format_uri_with_extra(secret, conn_string)
 
-    def get_conn_value(self, conn_id: str):
+    def _deserialize_json_string(self, value: Optional[str]) -> Optional[Dict[Any, Any]]:
+        if not value:
+            return None
+        try:
+            # Use ast.literal_eval for backwards compatibility.
+            # Previous version of this code had a comment saying that using json.loads caused errors.
+            # This likely means people were using dict reprs instead of valid JSONs.
+            res: Dict[str, Any] = json.loads(value)
+        except json.JSONDecodeError:
+            try:
+                res = ast.literal_eval(value) if value else None
+                warnings.warn(
+                    f'In future versions, `{type(self).__name__}` will only support valid JSONs, not dict'
+                    ' reprs. Please make sure your secret is a valid JSON.'
+                )
+            except ValueError:  # 'malformed node or string: ' error, for empty conns
+                return None
+
+        return res
+
+    def _remove_escaping_in_secret_dict(self, secret: Dict[str, Any], conn_id: str) -> Dict[str, Any]:
+        # When ``unquote(v) == v``, then removing unquote won't affect the user, regardless of
+        # whether or not ``v`` is URL-encoded. For example, "foo bar" is not URL-encoded. But
+        # because decoding it doesn't affect the value, then it will migrate safely when
+        # ``unquote`` gets removed.
+        #
+        # When parameters are URL-encoded, but decoding is idempotent, we need to warn the user
+        # to un-escape their secrets. For example, if "foo%20bar" is a URL-encoded string, then
+        # decoding is idempotent because ``unquote(unquote("foo%20bar")) == unquote("foo%20bar")``.
+        #
+        # In the rare situation that value is URL-encoded but the decoding is _not_ idempotent,
+        # this causes a major issue. For example, if "foo%2520bar" is URL-encoded, then decoding is
+        # _not_ idempotent because ``unquote(unquote("foo%2520bar")) != unquote("foo%2520bar")``
+        #
+        # This causes a problem for migration because if the user decodes their value, we cannot
+        # infer that is the case by looking at the decoded value (from our vantage point, it will
+        # look to be URL-encoded.)
+        #
+        # So when this uncommon situation occurs, the user _must_ adjust the configuration and set
+        # ``parameters_are_urlencoded`` to False to migrate safely. In all other cases, we do not
+        # need the user to adjust this object to migrate; they can transition their secrets with
+        # the default configuration.
+
+        warn_user = False
+        idempotent = True
+
+        for k, v in secret.copy().items():
+
+            if k == "extra" and isinstance(v, dict):
+                # The old behavior was that extras were _not_ urlencoded inside the secret.
+                # If they were urlencoded (e.g. "foo%20bar"), then they would be re-urlencoded
+                # (e.g. "foo%20bar" becomes "foo%2520bar") and then unquoted once when parsed.
+                # So we should just allow the extra dict to remain as-is.
+                continue
+
+            elif v is not None:
+                v_unquoted = unquote(v)
+                if v != v_unquoted:
+                    secret[k] = unquote(v)
+                    warn_user = True
+
+                    # Check to see if decoding is idempotent.
+                    if v_unquoted == unquote(v_unquoted):
+                        idempotent = False
+
+        if warn_user:
+            msg = (
+                "When ``full_url_mode=True``, URL-encoding secret values is deprecated. In future versions, "
+                f" this value will not be un-escaped. For the conn_id {conn_id!r}, please remove the"
+                " URL-encoding."
+                "\n\nThis warning was raised because the SecretsManagerBackend detected that this connection"
+                " was URL-encoded."
+            )
+            if idempotent:
+                msg = f" Once the values for conn_id {conn_id!r} are decoded, this warning will go away."
+            if not idempotent:
+                msg += (
+                    " In addition to decoding the values for your connection, you must also set"
+                    " ``secret_values_are_urlencoded=False`` for your config variable"
+                    " ``secrets.backend_kwargs`` because this connection's URL encoding is not idempotent."

Review Comment:
   I am not sure `because this connection's URL encoding is not idempotent` gives enough details to the user in order to understand the issue. I would add a link to the documentation you are updating in this PR



##########
docs/apache-airflow-providers-amazon/secrets-backends/aws-secrets-manager.rst:
##########
@@ -83,6 +83,27 @@ Verify that you can get the secret:
 
 If you don't want to use any ``connections_prefix`` for retrieving connections, set it as an empty string ``""`` in the configuration.
 
+URL-Encoding of Secrets When Full URL Mode is False
+"""""""""""""""""""""""""""""""""""""""""""""""""""
+
+Previous versions of the Amazon provider package required values in the AWS secret to be URL-encoded when the setting ``full_url_mode`` is ``false``.
+This behavior is now deprecated, and will be removed at a future date.
+
+In most cases, you should not have any issues migrating your secrets to not being URL-encoded in advance of the deprecation.
+Simply decoding your secret values will work, and no further changes are required.
+
+In rare circumstances, when URL-encoding is not idempotent, the ``DeprecationWarning`` will tell you to add a new parameter to your ``backend_kwargs``.

Review Comment:
   I would add an example to explain what idempotent means in this situation. I would actually reuse the example you gave in comments
   ```
   For example, if "foo%2520bar" is URL-encoded, then decoding is _not_ idempotent because ``unquote(unquote("foo%2520bar")) != unquote("foo%2520bar")``
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25432:
URL: https://github.com/apache/airflow/pull/25432#issuecomment-1206658959

   @o-nikolas @ferruzzi @vincbeck -> I'd also love to hear your opinion on that one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dwreeves commented on pull request #25432: Avoid requirement that AWS Secret Manager JSON values be urlencoded.

Posted by GitBox <gi...@apache.org>.
dwreeves commented on PR #25432:
URL: https://github.com/apache/airflow/pull/25432#issuecomment-1207272769

   Thank you for the good feedback @vincbeck, and thank you for the approval @potiuk!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org