You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/11/19 00:08:52 UTC

[GitHub] [airflow] potiuk opened a new pull request #12466: Adds support for Connection discovery from providers

potiuk opened a new pull request #12466:
URL: https://github.com/apache/airflow/pull/12466


   This PR extends providers discovery with the mechanism
   of retrieving mapping of connections from type to hook.
   
   Fixes #12456
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526511344



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       @ashb @mik-laj -> this is also for preparation to be able to use 3rd-party providers. I've added provider-package to the provider.yaml to avoid repetition and made all modules/classes relative. This way it is really safe - because we will not be able to point accidentally to a class outside of our own package.
   
   I also added a check if provider package is properly named bu also if it starts with "airflow.providers". 
   
   Currently we only support airflow.providers anyway. 
   
   It might be good to discuss now and agree future conventions for those 3rd-parties eventually
   
   1) are we allowing any package to contain provider?
   If not:
   2a) should all providers live in airflow.providers
   2b) should they live in airflow.contrib.providers (or similar)
   if 2a) and 2b) - > should we allow the packages to be named not following the convention :
   
   apache-airflow-providers-<PROVIDER> (with '.' replaced by -). 
   
   OR maybe some other approach? WDYT?
   
   WDYT?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r527272497



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       Thanks! 
   
   TL;DR; I think entrypoint idea for runtime optimization is really cool, happy to incorporate it as long as we keep "developer" experience intact (or even optimized in this case as well).
   
   > Yes, I firmly believe this should be the case. By not having third party packages live under the airflow. namespace, it becomes much much easier for Us and hopefully users to know what is an official provider, and thus supported by us vs not.
   
   1. Do you think we should prreprevent or enforce 3rd parties from using airflow.* ? I do not think there is a way of doing it unless we sign at least some of the package cryptographically with our own case? Or is it just convention you propose here? That other packages SHOULD NOT use "airflow" as package by convention? 
   
   > That way any package can register a provider, by adding something like this to it's setup.cfg:
   
   [options.entry_points]
   apache-airflow-provider=
       x=my_company.provider.x:register_provider
   
   That's a cool idea, especially that rather than returning a hard-coded dictionary, we can return the very same `provider.yaml` file, as a dictionary. That will make it super-easy for anyone writing their own provider as they will have the .json.schema that will define the expected structure of the file and it will be very easy to explain it. Very standard, supported by most IDEs with auto-completion and documentation (I used it to add the current files) and it can be used to verify such provider information during provider development.
   
   This might help to speed up the runtime discovery - One of the problems with `walk_packages` is that it is a bit slow. Sp having to use it at runtime is good. And I am happy to change the mechanism for runtime for that, as long (as discussed on slack) as we solve the problem of local provider development. We need to still support testing and developing airflow providers directly from sources in checked out airflow sources. Having to install each provider as a package in order to test it sounds like a terrible idea for people who develop providers (which are most of the casual contributors to airflow). 
   
   If for runtime we can switch to entry_points, that makes it possible also to optimize the - directly-from-sources case. This is then much easier because we can rely on packages being in the right folder and we can use simple directory walking to find all provider files, which makes it considerably faster. Paired with the runtime "entrypoint' discovery I think we can have a very optimized solution keeping all the nice properties we have now: keeping provider "meta-data" in one place, with validated json.schema yaml file in "airflow" providers. But if someone wants, they can still keep them as dict objects or json in 3rd-party providers as long as they follow the schema. I don't think we need to force anyone to yaml for that even if it is super-convenient to us and we use it also for document generation.
   
   
   Unless you have other, better idea to keep the developer experience intact.
   
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735373460


   > Some of the existing ones are also not DbApiHook either - docker, gcpssh, grpc for example.
   
   We didn't have clear rules on when to add classes, so sometimes classes were added but then never used. Actually in Airflow code, this is only used by operators that use DbApiHook. I did not observe any other use cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526785812



##########
File path: airflow/providers_manager.py
##########
@@ -46,42 +45,134 @@ class ProvidersManager:
     """Manages all provider packages."""
 
     def __init__(self):
+        self.__log = logging.getLogger(__name__)
         self._provider_directory = {}
         try:
             from airflow import providers
         except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
+            self.__log.warning("No providers are present or error when importing them! :%s", e)
             return
-        self._validator = _create_validator()
+        self.__connections: Dict[str, Tuple[str, str]] = {}
+        self.__validator = _create_validator()
         self.__find_all_providers(providers.__path__)
+        self.__retrieve_connections()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    @staticmethod
+    def get_full_class_name(provider_package: str, class_path: str):
+        """
+        Return full class name - if the class starts with ., it adds package_name in front.
+        :param provider_package: Package name
+        :param class_path: class path (might be relative if starts with .)
+        :return: full class name
+        """
+        if class_path.startswith('.'):
+            return f"{provider_package}{class_path}"
+        return class_path
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
-            try:
-                imported_module = importlib.import_module(module_info.name)
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when importing %s:%s", module_info.name, e)
-                continue
+    def __find_all_providers(self, paths: str) -> None:
+        """
+        Finds all providers installed and stores then in the directory of the providers.
+        :param paths: Path where to look for providers
+        """
+        import warnings
+
+        old_warn = warnings.warn
+
+        def warn(*args, **kwargs):
+            if "deprecated" not in args[0]:
+                old_warn(*args, **kwargs)
+
+        warnings.warn = warn

Review comment:
       Use the built in mechansim for this https://docs.python.org/3/library/warnings.html#temporarily-suppressing-warnings

##########
File path: airflow/providers_manager.py
##########
@@ -46,42 +45,134 @@ class ProvidersManager:
     """Manages all provider packages."""
 
     def __init__(self):
+        self.__log = logging.getLogger(__name__)

Review comment:
       (Not important, but why so private?)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532060474



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {

Review comment:
       That sounds good to me. If we need a new connection in Web UI, we have to load the module and mark it by adding this hook to this field.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532191286



##########
File path: airflow/plugins_manager.py
##########
@@ -30,6 +30,7 @@
 import importlib_metadata
 
 from airflow import settings
+from airflow.utils.entry_points_with_dist import entry_points_with_dist

Review comment:
       Module name is a little bit specific -- how about `from airflow.utils.plugins import entry_points_with_dist`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531566988



##########
File path: dev/provider_packages/SETUP_TEMPLATE.py.jinja2
##########
@@ -59,19 +59,6 @@ def do_setup(version_suffix_for_pypi=''):
         packages=find_namespace_packages(
             include=['airflow.providers.{{ PROVIDER_PACKAGE_ID }}',
                      'airflow.providers.{{ PROVIDER_PACKAGE_ID }}.*']),
-        package_data={ '': [
-{% if PROVIDER_PACKAGE_ID == 'amazon' %}
-          "*.json",
-{% elif PROVIDER_PACKAGE_ID == 'cncf.kubernetes' %}
-          "*.yaml",
-{% elif PROVIDER_PACKAGE_ID == 'google' %}
-          "*.yaml",
-          "*.sql",
-{% elif PROVIDER_PACKAGE_ID == 'papermill' %}
-          "*.ipynb",
-{% endif %}
-            ],
-        },

Review comment:
       ```
   [jarek:~/code/airflow] [test-pip-check] add-connection-discovery-to-providers+ ± unzip -t dist/apache_airflow_providers_google-1.0.0b2-py3-none-any.whl | grep sql 
       testing: airflow/providers/google/cloud/example_dags/example_bigquery_query.sql   OK
       testing: airflow/providers/google/cloud/example_dags/example_cloud_sql.py   OK
       testing: airflow/providers/google/cloud/example_dags/example_cloud_sql_query.py   OK
       testing: airflow/providers/google/cloud/example_dags/example_dataflow_sql.py   OK
       testing: airflow/providers/google/cloud/example_dags/example_mysql_to_gcs.py   OK
       testing: airflow/providers/google/cloud/example_dags/example_spanner.sql   OK
       testing: airflow/providers/google/cloud/hooks/cloud_sql.py   OK
       testing: airflow/providers/google/cloud/operators/cloud_sql.py   OK
       testing: airflow/providers/google/cloud/transfers/bigquery_to_mysql.py   OK
       testing: airflow/providers/google/cloud/transfers/mssql_to_gcs.py   OK
       testing: airflow/providers/google/cloud/transfers/mysql_to_gcs.py   OK
       testing: airflow/providers/google/cloud/transfers/sql_to_gcs.py   OK
   [jarek:~/code/airflow] [test-pip-check] add-connection-discovery-to-providers+ ± tar -tvzf dist/apache-airflow-providers-google-1.0.0b2.tar.gz| grep sql          
   -rw-r--r-- root/root     10832 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/example_dags/example_cloud_sql_query.py
   -rw-r--r-- root/root      2471 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/example_dags/example_dataflow_sql.py
   -rw-r--r-- root/root       879 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/example_dags/example_spanner.sql
   -rw-r--r-- root/root      1465 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/example_dags/example_mysql_to_gcs.py
   -rw-r--r-- root/root     13397 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/example_dags/example_cloud_sql.py
   -rw-r--r-- root/root       816 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/example_dags/example_bigquery_query.sql
   -rw-r--r-- root/root     44573 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/operators/cloud_sql.py
   -rw-r--r-- root/root     44392 2020-11-27 12:15 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/hooks/cloud_sql.py
   -rw-r--r-- root/root      6769 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/transfers/bigquery_to_mysql.py
   -rw-r--r-- root/root      5107 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/transfers/mysql_to_gcs.py
   -rw-r--r-- root/root      3093 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/transfers/mssql_to_gcs.py
   -rw-r--r-- root/root     13610 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/transfers/sql_to_gcs.py
   [jarek:~/code/airflow] [test-pip-check] add-connection-discovery-to-providers+ ± 
   
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526511344



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       @ashb @mik-laj -> this is also for preparation to be able to use 3rd-party providers. I've added provider-package to the provider.yaml to avoid repetition and made all modules/classes relative. This way it is really safe - because we will not be able to point accidentally to a class outside of our own package.
   
   I also added a check if provider package is properly named but only when it starts with "airflow.providers". 
   
   Currently we only support airflow.providers anyway. 
   
   It might be good to discuss now and agree future conventions for those 3rd-parties eventually
   
   1) are we allowing any package to contain provider?
   If not:
   2a) should all providers live in airflow.providers
   2b) should they live in airflow.contrib.providers (or similar)
   if 2a) and 2b) - > should we allow the packages to be named not following the convention :
   
   apache-airflow-providers-<PROVIDER> (with '.' replaced by -). 
   
   OR maybe some other approach? WDYT?
   
   WDYT?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r527275948



##########
File path: airflow/providers/apache/cassandra/provider.yaml
##########
@@ -33,10 +35,15 @@ integrations:
 sensors:
   - integration-name: Apache Cassandra
     python-modules:
-      - airflow.providers.apache.cassandra.sensors.record
-      - airflow.providers.apache.cassandra.sensors.table
+      - .sensors.record
+      - .sensors.table
 
 hooks:
   - integration-name: Apache Cassandra
     python-modules:
-      - airflow.providers.apache.cassandra.hooks.cassandra
+      - .hooks.cassandra
+
+connections:
+  - connection-type: cassandra
+    hook-class: .hooks.cassandra.CassandraHook
+    connection-id-parameter-name: cassandra_conn_id

Review comment:
       There are many other things that are not yet defined here (other PRs are addressing it) - extra links and connection in forms in javascript. If you look at those, you will understand that those cannot be as easily incorporated in the classes. 
   
   I prefer to keep "provider delivered functionality" in one place rather than some of it in .yaml, some of this in some connection and some of this in other classes.
   
   Also, our proposal is far from complete - it does not actually solve the problem of discovering those Hooks (unless you have not posted it. Even with the entry_points mechanism, you will have to somehow at least keep the list of hooks in each provider (there are at least 4 in the google provider). That would mean that you would have to not only keep the list of hooks but also remember to update the  "extra" information in the hooks - this will have to be (by its nature) split between two places. 
   
   What was your idea here? where did you want to keep the list of connections to load?
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r528382711



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       Hey @ashb - implemented all that in #12512  and this one is only left for connections. we can discuss it separately.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r527275948



##########
File path: airflow/providers/apache/cassandra/provider.yaml
##########
@@ -33,10 +35,15 @@ integrations:
 sensors:
   - integration-name: Apache Cassandra
     python-modules:
-      - airflow.providers.apache.cassandra.sensors.record
-      - airflow.providers.apache.cassandra.sensors.table
+      - .sensors.record
+      - .sensors.table
 
 hooks:
   - integration-name: Apache Cassandra
     python-modules:
-      - airflow.providers.apache.cassandra.hooks.cassandra
+      - .hooks.cassandra
+
+connections:
+  - connection-type: cassandra
+    hook-class: .hooks.cassandra.CassandraHook
+    connection-id-parameter-name: cassandra_conn_id

Review comment:
       There are many other things that are not yet defined here (other PRs are addressing it) - extra links and connection in forms in javascript. If you look at those, you will understand that those cannot be as easily incorporated in the classes. 
   
   I prefer to keep "provider delivered functionality" in one place rather than some of it in .yaml, some of this in some connection and some of this in other classes.
   
   Also, our proposal is far from complete - it does not actually solve the problem of discovering those Hooks (unless you have not posted the entry_points mechanism, you will have to somehow at least keep the list of hooks in each provider (there are at least 4 in the google provider). That would mean that you would have to not only keep the list of hooks but also remember to update the  "extra" information in the hooks - this will have to be (by its nature) split between two places. 
   
   What was your idea here? where did you want to keep the list of connections to load?
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532187749



##########
File path: airflow/providers/google/cloud/hooks/bigquery.py
##########
@@ -67,7 +67,9 @@
 class BigQueryHook(GoogleBaseHook, DbApiHook):
     """Interact with BigQuery. This hook uses the Google Cloud connection."""
 
-    conn_name_attr = 'gcp_conn_id'  # type: str
+    conn_name_attr = 'gcp_conn_id'
+    default_conn_name = 'google_cloud_default'
+    conn_type = 'google_cloud_platform'

Review comment:
       @potiuk :+1: 
   
   @mik-laj If we want to be able to get a BigQuery hook to from `Connection.get_hook` we'll need a unique type won't we?
   
   (Right now the conn type for BQ is google_cloud_platform, but that means, say, DataProc can't be google_cloud_platform)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735364914


   Ah, you've just done the existing ones, cool. Didn't compare that - I had a list of providers open on the other screen was all.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532121242



##########
File path: airflow/providers/google/cloud/hooks/bigquery.py
##########
@@ -67,7 +67,9 @@
 class BigQueryHook(GoogleBaseHook, DbApiHook):
     """Interact with BigQuery. This hook uses the Google Cloud connection."""
 
-    conn_name_attr = 'gcp_conn_id'  # type: str
+    conn_name_attr = 'gcp_conn_id'
+    default_conn_name = 'google_cloud_default'
+    conn_type = 'google_cloud_platform'

Review comment:
       Shouldn't this be something like `conn_type = 'big_query'`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532087796



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {

Review comment:
       For me, this is good enough for now. In the future, we can divide these classes into two separate classes. We don't really need to do this yet.
   
   However, if you are interested, I can prepare a small snippet of code that does this.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532053560



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {

Review comment:
       I wanted to explicity call out the Hook classes not modules. I think this will be faster and better than trying to find Hooks and there might some exceptions that we would have to handle (for example Hooks derived from other Hooks). however I wanted to make some check and see how many those exceptions we have and if we find that there are not many, we could indeed try to find all the classes deriving from Base Hook in the "hook" modules specified. I will take a look.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526511344



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       @ashb @mik-laj -> this is also for preparation to be able to use 3rd-party providers. I've added provider-package to the provider.yaml to avoid repetition and made all modules/classes relative. This way it is really safe - because we will not be able to point accidentally to a class outside of our own package.
   
   I also added a check if provider package is properly named but only when it starts with "airflow.providers". 
   
   Currently we only support `airflow.providers` anyway. 
   
   It might be good to discuss now and agree future conventions for those 3rd-parties eventually
   
   * 1) are we allowing any package to contain provider?
   
   If not:
   
   * 2a) should all providers live in airflow.providers
   * 2b) should they live in airflow.contrib.providers (or similar)
   
   if 2a) or 2b) - > should  the packages name following the convention:
   
   ```
      apache-airflow-providers-<PROVIDER> (with '.' replaced by -). 
   ```
   
   
   OR maybe some other approach? WDYT?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532193500



##########
File path: airflow/plugins_manager.py
##########
@@ -30,6 +30,7 @@
 import importlib_metadata
 
 from airflow import settings
+from airflow.utils.entry_points_with_dist import entry_points_with_dist

Review comment:
       But I think "entry_point_with_dist" is a better name. we can use it also for non-plugins (providers_manager is technically not a plugin in the sense of "Airflow Plugin"). This is a very nice a little small utility function that can be used for anything that wants to quickly retrieve entry_points, not necessarily plugins.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735347672


   > You're missing a lot more actually:
   
   Bringing those connections in which were not defined in the original "connection.py" is non -goal of that change. If we are planning to add all missing connections (I certainly did not) this should be a follow-up PR. I only converted those connections that have already been defined and I left up to provider to define which connections should be added.
   
   I explained it in this comment yesterday @ashb: https://github.com/apache/airflow/pull/12466#discussion_r532059119
   
   I think it requires much more than just bringing those, we need to define some standard approach for those - what should we do with those hooks that have hierarchy (see all amazon hooks). How should we define connection names for those? Do we have any guidelines that you can point me to @ashb ? Maybe (if you want to bring them in) - you can make a proposal for that (but still I would leave it to a follow-up PR).
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532190915



##########
File path: airflow/providers_manager.py
##########
@@ -60,38 +63,29 @@ def __new__(cls):
         return cls._instance
 
     def __init__(self):
-        # Keeps list of providers keyed by module name and value is Tuple: version, provider_info
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
         self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
         # Local source folders are loaded first. They should take precedence over the package ones for
         # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
         # So there is no risk we are going to override package provider accidentally. This can only happen
         # in case of local development
         self._discover_all_airflow_builtin_providers_from_local_sources()
         self._discover_all_providers_from_packages()
-        self._sort_provider_dictionary()
-
-    def _sort_provider_dictionary(self):
-        """
-        Sort provider_dictionary using OrderedDict.
-
-        The dictionary gets sorted so that when you iterate through it, the providers are by
-        default returned in alphabetical order.
-        """
-        sorted_dict = OrderedDict()
-        for provider_name in sorted(self._provider_dict.keys()):
-            sorted_dict[provider_name] = self._provider_dict[provider_name]
-        self._provider_dict = sorted_dict
+        self._discover_hooks()
+        self._provider_dict = OrderedDict(sorted(self.providers.items()))
+        self._hooks_dict = OrderedDict(sorted(self.providers.items()))
 
     def _discover_all_providers_from_packages(self) -> None:
         """
         Discovers all providers by scanning packages installed. The list of providers should be returned
         via the 'apache_airflow_provider' entrypoint as a dictionary conforming to the
         'airflow/provider.yaml.schema.json' schema.
         """
-        from airflow.plugins_manager import entry_points_with_dist

Review comment:
       An import in a function should never be able to cause a circular import I thought, unless this function is also called from a top level somewhere.
   
   That's why I put the import here, not at the top level.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532081844



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {

Review comment:
       What's your proposal @mik-laj ? How should we refactor it ? Also remember that it builds on previously existing fields in DBApi (I followed the same convention). 
   
   We could use Protocol (but it's Python 3.8 https://www.python.org/dev/peps/pep-0544/) - we already backport it from typing_extensions and seems like exactly what we could use. We could have Hook also derive from DiscoverableHook (for example - I have no good name for it) and have all the hooks listed there implement it. 
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526511344



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       @ashb @mik-laj @kaxil -> this is also for preparation to be able to use 3rd-party providers. I've added provider-package to the provider.yaml to avoid repetition and made all modules/classes relative. This way it is really safe - because we will not be able to point accidentally to a class outside of our own package.
   
   I also added a check if provider package is properly named but only when it starts with "airflow.providers". 
   
   Currently we only support `airflow.providers` anyway. 
   
   It might be good to discuss now and agree future conventions for those 3rd-parties eventually
   
   * 1) are we allowing any package to contain provider?
   
   If not:
   
   * 2a) should all providers live in airflow.providers
   * 2b) should they live in airflow.contrib.providers (or similar)
   
   if 2a) or 2b) - > should  the packages name following the convention:
   
   ```
      apache-airflow-providers-<PROVIDER> (with '.' replaced by -). 
   ```
   
   
   OR maybe some other approach? WDYT?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735373831


   > Just one last fixup - we had cyclic dependency Connection -> provider_manager -> plugins_manager -> Hook -> Connections. Solved it by extracting entrypoint_with_dist to a separate util module (as this is an util used by both plugins_manager and providers_manager now).
   
   Did something really detect that as a cycle? Cos importing `connection.py` would not cause the provider_manager to import anything from plugins_manager -- only running code would do that, not simply importing. Therefore that is not a cyclic import.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532160271



##########
File path: airflow/providers/google/cloud/hooks/bigquery.py
##########
@@ -67,7 +67,9 @@
 class BigQueryHook(GoogleBaseHook, DbApiHook):
     """Interact with BigQuery. This hook uses the Google Cloud connection."""
 
-    conn_name_attr = 'gcp_conn_id'  # type: str
+    conn_name_attr = 'gcp_conn_id'
+    default_conn_name = 'google_cloud_default'
+    conn_type = 'google_cloud_platform'

Review comment:
       We don't need a separate type for BigQuery. In Airflow 1.10, we had a few more connection types, but in Airflow 2.0, this was standardized and most connections use one connection type. In some exceptional situations we need additional configuration options and then we use a new type of connection.
   See: http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow-providers-google/latest/connections/index.html
   https://github.com/apache/airflow/blob/master/UPDATING.md#normalize-gcp_conn_id-for-google-cloud




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532206730



##########
File path: airflow/plugins_manager.py
##########
@@ -30,6 +30,7 @@
 import importlib_metadata
 
 from airflow import settings
+from airflow.utils.entry_points_with_dist import entry_points_with_dist

Review comment:
       That's fine, I can change it this way, no problem.
   
   In general, my opinon is that there is nothing wrong in single class/function per module as long as there is nothing else it is really coupled with. 
   
   I am not sure what benefits it could bring to group independent functions in the same module, other than accidental coupling. I prefer more files than unnecessary, accidental coupling. And you know - the design and coupling emerge over time and we could also rename it later (refactorings and renames are super easy if you use IDEs).
   
   But I agree - having both module and function the same name is bad, so if only for that reason - happy to change it.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532193500



##########
File path: airflow/plugins_manager.py
##########
@@ -30,6 +30,7 @@
 import importlib_metadata
 
 from airflow import settings
+from airflow.utils.entry_points_with_dist import entry_points_with_dist

Review comment:
       But I think "entry_point_with_dist" is a better name. we can use it also for non-plugins (providers_manager is technically not a plugin in the sense of "Airflow Plugin"). This is a very nice a little small utility function that can be used for anything that wants to quickly and reliably retrieve entry_points, not necessarily plugins.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r528965364



##########
File path: airflow/providers/apache/hive/hooks/hive.py
##########
@@ -78,9 +78,13 @@ class HiveCliHook(BaseHook):
     :type  mapred_job_name: str
     """
 
+    conn_name_attr = 'hive_cli_conn_id'
+    default_conn_name = 'hive_cli_default'
+    conn_type = 'hive_cli'
+
     def __init__(
         self,
-        hive_cli_conn_id: str = "hive_cli_default",
+        hive_cli_conn_id: str = default_conn_name,

Review comment:
       You might not have explicitly stated it, but this is creating a new Hook interface.
   
   I guess the main thing is this is opt-in, so existing sub-classes don't suddenly break. Good point. :+1: 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531484268



##########
File path: airflow/providers/microsoft/mssql/hooks/mssql.py
##########
@@ -36,7 +37,7 @@ def __init__(self, *args, **kwargs) -> None:
             (
                 "This class is deprecated and will be removed in Airflow 2.0.\n"
                 "pymssql is discontinued.  See https://github.com/pymssql/pymssql/issues/668.\n"
-                "Please use `airflow.providers.odbc.hooks.odbc.OdbcHook`"

Review comment:
       Is this deprecation just because pymsql _was_ deprecated but is no longer?
   
   If so could we not remove this depreaction, and not have to add the new `MsSqlHookWithOdbcHook` class?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526782670



##########
File path: airflow/providers/apache/cassandra/provider.yaml
##########
@@ -33,10 +35,15 @@ integrations:
 sensors:
   - integration-name: Apache Cassandra
     python-modules:
-      - airflow.providers.apache.cassandra.sensors.record
-      - airflow.providers.apache.cassandra.sensors.table
+      - .sensors.record
+      - .sensors.table
 
 hooks:
   - integration-name: Apache Cassandra
     python-modules:
-      - airflow.providers.apache.cassandra.hooks.cassandra
+      - .hooks.cassandra
+
+connections:
+  - connection-type: cassandra
+    hook-class: .hooks.cassandra.CassandraHook
+    connection-id-parameter-name: cassandra_conn_id

Review comment:
       I'd rather we defined these as properties in code than in a yaml file
   ```suggestion
   connections:
     - hook-class: .hooks.cassandra.CassandraHook
   ```
   
   And then:
   
   ```python
   class CassandraHook(BaseHook):
      CONNECTION_TYPE="cassandra"
      CONNECTION_ID_PARAM_NAME="cassanrda_conn_id"
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735230536


   [The Workflow run](https://github.com/apache/airflow/actions/runs/388652304) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735375753


   > Therefore that is not a cyclic import.
   
   Of course not cyclic import. Cyclic imports are detected by python parser so there is no point static-checking them. Pylint detects cyclic dependencies between modules instead.
   
   It's a cyclic dependency between modules, not a cyclic import (just to reiterate what I keep on repeating). 
   
   Those are different things and this one is easy (and natural) to solve. If two modules are using a common function, it's better to extract it to a separate module, rather than having one module depended on the other where the dependency could be reverted.
   
   Both providers_manager and plugins_manager are at the same "logical" level, so there is no reason why providers manager should import anything from plugins_manager. Why not plugins manager importing the same function from providers manager ? 
   
   So extracting it to common module makes perfect sense and is "cleaner" architecture - and this is what pylint complains about.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531575321



##########
File path: dev/provider_packages/SETUP_TEMPLATE.py.jinja2
##########
@@ -59,19 +59,6 @@ def do_setup(version_suffix_for_pypi=''):
         packages=find_namespace_packages(
             include=['airflow.providers.{{ PROVIDER_PACKAGE_ID }}',
                      'airflow.providers.{{ PROVIDER_PACKAGE_ID }}.*']),
-        package_data={ '': [
-{% if PROVIDER_PACKAGE_ID == 'amazon' %}
-          "*.json",
-{% elif PROVIDER_PACKAGE_ID == 'cncf.kubernetes' %}
-          "*.yaml",
-{% elif PROVIDER_PACKAGE_ID == 'google' %}
-          "*.yaml",
-          "*.sql",
-{% elif PROVIDER_PACKAGE_ID == 'papermill' %}
-          "*.ipynb",
-{% endif %}
-            ],
-        },

Review comment:
       I thought manifest was for sdist and package_data was for wheel. But cool.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531563105



##########
File path: dev/provider_packages/SETUP_TEMPLATE.py.jinja2
##########
@@ -59,19 +59,6 @@ def do_setup(version_suffix_for_pypi=''):
         packages=find_namespace_packages(
             include=['airflow.providers.{{ PROVIDER_PACKAGE_ID }}',
                      'airflow.providers.{{ PROVIDER_PACKAGE_ID }}.*']),
-        package_data={ '': [
-{% if PROVIDER_PACKAGE_ID == 'amazon' %}
-          "*.json",
-{% elif PROVIDER_PACKAGE_ID == 'cncf.kubernetes' %}
-          "*.yaml",
-{% elif PROVIDER_PACKAGE_ID == 'google' %}
-          "*.yaml",
-          "*.sql",
-{% elif PROVIDER_PACKAGE_ID == 'papermill' %}
-          "*.ipynb",
-{% endif %}
-            ],
-        },

Review comment:
       Are you sure of that? Isn't the MANIFEST mechanism enough for that? 
   
   https://github.com/apache/airflow/commit/765cbbcd76900fd0777730aaba637058b089ac95#diff-4aee44fe1bb835097b401e18b813bf5181552d25c06a701afa60f9aec4530380R19  
   
   ```
   {% if PROVIDER_PACKAGE_ID == 'amazon' %}
   include airflow/providers/amazon/aws/hooks/batch_waiters.json
   {% elif PROVIDER_PACKAGE_ID == 'cncf.kubernetes' %}
   include airflow/providers/cncf/kubernetes/example_dags/example_spark_kubernetes_spark_pi.yaml
   {% elif PROVIDER_PACKAGE_ID == 'google' %}
   include airflow/providers/google/cloud/example_dags/*.yaml
   include airflow/providers/google/cloud/example_dags/*.sql
   {% elif PROVIDER_PACKAGE_ID == 'papermill' %}
   include airflow/providers/papermill/example_dags/*.ipynb
   {% endif %}
   ```
   
   
   Or does it have to be duplicated for some reason?  Is manifest only for sdist and package_data here for whl? Do I guess correctly ? 
   
   It would contradict the documentation though - from https://setuptools.readthedocs.io/en/latest/userguide/datafiles.html it looks like MANIFEST.in allows to specify includes alone, and package_data is an optional counterpart that allows more fine-grain control. 
   
   > Setuptools offers three ways to specify data files to be included in your packages. First, you can simply use the include_package_data keyword, e.g.:
   
   ```
   setup(
       ...
       include_package_data=True
   )
   ```
   > This tells setuptools to install any data files it finds in your packages. The data files must be specified via the distutils’ MANIFEST.in file.
   
   > If you want finer-grained control over what files are included (for example, if you have documentation files in your package directories and want to exclude them from installation), then you can also use the package_data keyword, e.g.:
   
   Do I read that correctly that we only need includes in MANIFEST.in ? I think I checked that the selected files in `google` and `cncf.kubernetes` were added when I removed the package_data  entry. But I will double check it again.
   
   The version with both mechanisms cause pretty confusing behaviour and took me quite some time to discover why we have provider.yaml in the google and cncf.kubernetes packages only.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-734953283


   @ashb, looks like this one is also ready to merge and passing all the tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532192328



##########
File path: airflow/providers/google/cloud/hooks/bigquery.py
##########
@@ -67,7 +67,9 @@
 class BigQueryHook(GoogleBaseHook, DbApiHook):
     """Interact with BigQuery. This hook uses the Google Cloud connection."""
 
-    conn_name_attr = 'gcp_conn_id'  # type: str
+    conn_name_attr = 'gcp_conn_id'
+    default_conn_name = 'google_cloud_default'
+    conn_type = 'google_cloud_platform'

Review comment:
       Oh good to know.
   
   We'll have to come up with some approach to registereing the information for UI without importing the hook classes then.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526704452



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       I will also split this into two PRs - one adding "provider-package" and one adding connections, to keep it separate.
   
   Also after looking at how fast and easy it is to extract all the "provider" features (extra links, connections ets) and keep in  the  provider.yaml,  I think maybe ewe should bring the discussion about conventions for 3rd-party providers to the devllst?  This was non-goal for AIP-8 and Airflow 2.0, but it seems we are SOOOO close to it that it's worth doing the extra effort and  discuss and vote the 3rd-party provider conventions showing examples from our existing providers.
   
   The nice thing about it that we can provide some examples and clear instructions on "how to add a new 3rrd-party provider" and even a way to generate a really nice documentation with examples. Seems that we already pretty much have what we need for that very soon, it just needs agreement on the devlist and we still have time to discuss it before 2.0.
   
   WDYT?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735377509


   [The Workflow run](https://github.com/apache/airflow/actions/runs/389825585) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r529497695



##########
File path: airflow/providers/apache/hive/hooks/hive.py
##########
@@ -78,9 +78,13 @@ class HiveCliHook(BaseHook):
     :type  mapred_job_name: str
     """
 
+    conn_name_attr = 'hive_cli_conn_id'

Review comment:
       BTW. Happy to make a change later (maybe even we manage in 2.0 if you think it's ok to break compatibility with the names. Once it is consistent, we can do global rename.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531554294



##########
File path: airflow/providers_manager.py
##########
@@ -36,52 +39,231 @@
 
 
 def _create_validator():
+    """Creates JSON schema validator from the provider.yaml.schema.json"""
     schema = json.loads(importlib_resources.read_text('airflow', 'provider.yaml.schema.json'))
     cls = jsonschema.validators.validator_for(schema)
     validator = cls(schema)
     return validator
 
 
 class ProvidersManager:
-    """Manages all provider packages."""
+    """
+    Manages all provider packages. This is a Singleton class. The first time it is
+    instantiated, it discovers all available providers in installed packages and
+    local source folders (if airflow is run from sources).
+    """
+
+    _instance = None
+    resource_version = "0"
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
 
     def __init__(self):
-        self._provider_directory = {}
-        try:
-            from airflow import providers
-        except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
-            return
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
+        self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
-        self.__find_all_providers(providers.__path__)
+        # Local source folders are loaded first. They should take precedence over the package ones for
+        # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
+        # So there is no risk we are going to override package provider accidentally. This can only happen
+        # in case of local development
+        self._discover_all_airflow_builtin_providers_from_local_sources()
+        self._discover_all_providers_from_packages()
+        self._discover_hooks()
+        self._sort_provider_dictionary()
+        self._sort_hooks_dictionary()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    def _sort_hooks_dictionary(self):
+        """
+        Creates provider_directory as sorted (by package_name) OrderedDict.
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
+        Duplicates are removed from "package" providers in case corresponding "folder" provider is found.
+        The "folder" providers are from local sources (packages do not contain provider.yaml files),
+        so if someone has airflow installed from local sources, the providers are imported from there
+        first so, provider information should be taken from there.
+        :return:
+        """
+        sorted_dict = OrderedDict()
+        for connection_type in sorted(self._hooks_dict.keys()):
+            sorted_dict[connection_type] = self._hooks_dict[connection_type]
+        self._hooks_dict = sorted_dict
+
+    def _sort_provider_dictionary(self):
+        """
+        Sort provider_dictionary using OrderedDict.
+
+        The dictionary gets sorted so that when you iterate through it, the providers are by
+        default returned in alphabetical order.
+        """
+        sorted_dict = OrderedDict()
+        for provider_name in sorted(self._provider_dict.keys()):
+            sorted_dict[provider_name] = self._provider_dict[provider_name]
+        self._provider_dict = sorted_dict
+
+    def _discover_all_providers_from_packages(self) -> None:
+        """
+        Discovers all providers by scanning packages installed. The list of providers should be returned
+        via the 'apache_airflow_provider' entrypoint as a dictionary conforming to the
+        'airflow/provider.yaml.schema.json' schema.
+        """
+        for entry_point in pkg_resources.iter_entry_points('apache_airflow_provider'):
+            package_name = entry_point.dist.project_name
+            log.debug("Loading %s from package %s", entry_point, package_name)
+            version = entry_point.dist.version
             try:
-                imported_module = importlib.import_module(module_info.name)
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when importing %s:%s", module_info.name, e)
+                provider_info = entry_point.load()()
+            except pkg_resources.VersionConflict as e:
+                log.warning(
+                    "The provider package %s could not be registered because of version conflict : %s",
+                    package_name,
+                    e,
+                )
                 continue
-            try:
-                provider = importlib_resources.read_text(imported_module, 'provider.yaml')
-                provider_info = yaml.safe_load(provider)
-                self._validator.validate(provider_info)
-                self._provider_directory[provider_info['package-name']] = provider_info
-            except FileNotFoundError:
-                # This is OK - this is not a provider package
-                pass
-            except TypeError as e:
-                if "is not a package" not in str(e):
-                    log.warning("Error when loading 'provider.yaml' file from %s:%s}", module_info.name, e)
-                # Otherwise this is OK - this is likely a module
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when loading 'provider.yaml' file from %s:%s", module_info.name, e)
+            self._validator.validate(provider_info)
+            provider_info_package_name = provider_info['package-name']
+            if package_name != provider_info_package_name:
+                raise Exception(
+                    f"The package '{package_name}' from setuptools and "
+                    f"{provider_info_package_name} do not match. Please make sure they are aligned"
+                )
+            if package_name not in self._provider_dict:
+                self._provider_dict[package_name] = (version, provider_info)
+            else:
+                log.warning(
+                    "The provider for package '%s' could not be registered from because providers for that "
+                    "package name have already been registered",
+                    package_name,
+                )
+
+    def _discover_all_airflow_builtin_providers_from_local_sources(self) -> None:
+        """
+        Finds all built-in airflow providers if airflow is run from the local sources.
+        It finds `provider.yaml` files for all such providers and registers the providers using those.
+
+        This 'provider.yaml' scanning takes precedence over scanning packages installed
+        in case you have both sources and packages installed, the providers will be loaded from
+        the "airflow" sources rather than from the packages.
+        """
+        try:
+            import airflow.providers
+        except ImportError:
+            log.info("You have no providers installed.")
+            return
+        try:
+            for path in airflow.providers.__path__:
+                self._add_provider_info_from_local_source_files_on_path(path)
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning("Error when loading 'provider.yaml' files from airflow sources: %s", e)
+
+    def _add_provider_info_from_local_source_files_on_path(self, path) -> None:
+        """
+        Finds all the provider.yaml files in the directory specified.
+
+        :param path: path where to look for provider.yaml files
+        """
+        root_path = path
+        for folder, subdirs, files in os.walk(path, topdown=True):
+            for filename in fnmatch.filter(files, "provider.yaml"):
+                package_name = "apache-airflow-providers" + folder[len(root_path) :].replace(os.sep, "-")
+                self._add_provider_info_from_local_source_file(os.path.join(folder, filename), package_name)
+                subdirs[:] = []
+
+    def _add_provider_info_from_local_source_file(self, path, package_name) -> None:
+        """
+        Parses found provider.yaml file and adds found provider to the dictionary.
+
+        :param path: full file path of the provider.yaml file
+        :param package_name: name of the package
+        """
+        try:
+            log.debug("Loading %s from %s", package_name, path)
+            with open(path) as provider_yaml_file:
+                provider_info = yaml.safe_load(provider_yaml_file)
+            self._validator.validate(provider_info)
+
+            version = provider_info['versions'][0]
+            if package_name not in self._provider_dict:
+                self._provider_dict[package_name] = (version, provider_info)
+            else:
+                log.warning(
+                    "The providers for package '%s' could not be registered because providers for that "
+                    "package name have already been registered",
+                    package_name,
+                )
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning("Error when loading '%s': %s", path, e)
+
+    def _discover_hooks(self) -> None:
+        """Retrieves all connections defined in the providers"""
+        for name, provider in self._provider_dict.items():
+            provider_package = name
+            hook_class_names = provider[1].get("hook-class-names")
+            if hook_class_names:
+                for hook_class_name in hook_class_names:
+                    self._add_hook(hook_class_name, provider_package)
+
+    def _add_hook(self, hook_class_name, provider_package) -> None:
+        """
+        Adds hook class name to list of hooks
+
+        :param hook_class_name: name of the Hook class
+        :param provider_package: provider package adding the hook
+        """
+        if provider_package.startswith("apache-airflow"):
+            provider_path = provider_package[len("apache-") :].replace("-", ".")
+            if not hook_class_name.startswith(provider_path):
+                log.warning(
+                    "Sanity check failed when importing '%s' from '%s' package. It should start with '%s'",
+                    hook_class_name,
+                    provider_package,
+                    provider_path,
+                )
+                return
+        if hook_class_name in self._hooks_dict:
+            log.warning(
+                "The hook_class '%s' has been already registered.",
+                hook_class_name,
+            )
+            return
+        try:
+            module, class_name = hook_class_name.rsplit('.', maxsplit=1)
+            hook_class = getattr(importlib.import_module(module), class_name)
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning(
+                "Exception when importing '%s' from '%s' package: %s",
+                hook_class_name,
+                provider_package,
+                e,
+            )
+            return
+        conn_type = getattr(hook_class, 'conn_type')
+        if not conn_type:
+            log.warning(
+                "The hook_class '%s' misses connection_type attribute and cannot be registered",
+                hook_class,
+            )
+            return
+        connection_if_attribute_name = getattr(hook_class, 'conn_name_attr')
+        if not connection_if_attribute_name:
+            log.warning(
+                "The hook_class '%s' misses conn_name_attr attribute and cannot be registered",
+                hook_class,
+            )
+            return
+
+        self._hooks_dict[conn_type] = (hook_class_name, connection_if_attribute_name)

Review comment:
       Happy to change it here anyway.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735370303


   [The Workflow run](https://github.com/apache/airflow/actions/runs/389762297) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531492454



##########
File path: airflow/providers_manager.py
##########
@@ -36,52 +39,231 @@
 
 
 def _create_validator():
+    """Creates JSON schema validator from the provider.yaml.schema.json"""
     schema = json.loads(importlib_resources.read_text('airflow', 'provider.yaml.schema.json'))
     cls = jsonschema.validators.validator_for(schema)
     validator = cls(schema)
     return validator
 
 
 class ProvidersManager:
-    """Manages all provider packages."""
+    """
+    Manages all provider packages. This is a Singleton class. The first time it is
+    instantiated, it discovers all available providers in installed packages and
+    local source folders (if airflow is run from sources).
+    """
+
+    _instance = None
+    resource_version = "0"
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
 
     def __init__(self):
-        self._provider_directory = {}
-        try:
-            from airflow import providers
-        except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
-            return
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
+        self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
-        self.__find_all_providers(providers.__path__)
+        # Local source folders are loaded first. They should take precedence over the package ones for
+        # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
+        # So there is no risk we are going to override package provider accidentally. This can only happen
+        # in case of local development
+        self._discover_all_airflow_builtin_providers_from_local_sources()
+        self._discover_all_providers_from_packages()
+        self._discover_hooks()
+        self._sort_provider_dictionary()
+        self._sort_hooks_dictionary()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    def _sort_hooks_dictionary(self):
+        """
+        Creates provider_directory as sorted (by package_name) OrderedDict.
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
+        Duplicates are removed from "package" providers in case corresponding "folder" provider is found.
+        The "folder" providers are from local sources (packages do not contain provider.yaml files),
+        so if someone has airflow installed from local sources, the providers are imported from there
+        first so, provider information should be taken from there.
+        :return:
+        """
+        sorted_dict = OrderedDict()
+        for connection_type in sorted(self._hooks_dict.keys()):
+            sorted_dict[connection_type] = self._hooks_dict[connection_type]
+        self._hooks_dict = sorted_dict
+
+    def _sort_provider_dictionary(self):
+        """
+        Sort provider_dictionary using OrderedDict.
+
+        The dictionary gets sorted so that when you iterate through it, the providers are by
+        default returned in alphabetical order.
+        """
+        sorted_dict = OrderedDict()
+        for provider_name in sorted(self._provider_dict.keys()):
+            sorted_dict[provider_name] = self._provider_dict[provider_name]
+        self._provider_dict = sorted_dict

Review comment:
       ```suggestion
           self._provider_dict = OrderedDict(sorted(self._provider_dict.items()))
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r528749701



##########
File path: airflow/providers/apache/hive/hooks/hive.py
##########
@@ -78,9 +78,13 @@ class HiveCliHook(BaseHook):
     :type  mapred_job_name: str
     """
 
+    conn_name_attr = 'hive_cli_conn_id'

Review comment:
       Just following "do the same as others"  simply we already had other connection class attributes lowercase (convention used in the SQL Hooks). So I just followed it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r527272497



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       Thanks! 
   
   TL;DR; I think entrypoint idea for runtime optimization is really cool, happy to incorporate it as long as we keep "developer" experience intact (or even optimized in this case as well).
   
   > Yes, I firmly believe this should be the case. By not having third party packages live under the airflow. namespace, it becomes much much easier for Us and hopefully users to know what is an official provider, and thus supported by us vs not.
   
   1. Do you think we should prreprevent or enforce 3rd parties from using airflow.* ? I do not think there is a way of doing it unless we sign at least some of the package cryptographically with our own case? Or is it just convention you propose here? That other packages SHOULD NOT use "airflow" as package by convention? 
   
   > That way any package can register a provider, by adding something like this to it's setup.cfg:
   
   [options.entry_points]
   apache-airflow-provider=
       x=my_company.provider.x:register_provider
   
   That's a cool idea, especially that rather than returning a hard-coded dictionary, we can return the very same `provider.yaml` file, as a dictionary. That will make it super-easy for anyone writing their own provider as they will have the .json.schema that will define the expected structure of the file and it will be very easy to explain it. Very standard, supported by most IDEs with auto-completion and documentation (I used it to add the current files) and it can be used to verify such provider information during provider development.
   
   This might help to speed up the runtime discovery - One of the problems with `walk_packages` is that it is a bit slow. So not having to use it at runtime is good. And I am happy to change the mechanism for runtime for that, as long (as discussed on slack) as we solve the problem of local provider development. We need to still support testing and developing airflow providers directly from sources in checked out airflow sources. Having to install each provider as a package in order to test it sounds like a terrible idea for people who develop providers (which are most of the casual contributors to airflow). 
   
   If for runtime we can switch to entry_points, that makes it possible also to optimize the - directly-from-sources case. This is then much easier because we can rely on packages being in the right folder and we can use simple directory walking to find all provider files, which makes it considerably faster. Paired with the runtime "entrypoint' discovery I think we can have a very optimized solution keeping all the nice properties we have now: keeping provider "meta-data" in one place, with validated json.schema yaml file in "airflow" providers. But if someone wants, they can still keep them as dict objects or json in 3rd-party providers as long as they follow the schema. I don't think we need to force anyone to yaml for that even if it is super-convenient to us and we use it also for document generation.
   
   
   Unless you have other, better idea to keep the developer experience intact.
   
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526704452



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       I will also split this into two PRs - one adding "provider-package" and one adding connections, to keep it separate.
   
   Also after looking at how fast and easy it is to extract all the "provider" features (extra links, connections etc.) and keep in  the  provider.yaml,  I think maybe even should bring the discussion about conventions for 3rd-party providers to the devllst?  This was non-goal for AIP-8 and Airflow 2.0, but it seems we are SOOOO close to it that it's worth doing the extra effort and discuss and vote the 3rd-party provider conventions showing examples from our existing providers.
   
   The nice thing about it that we can provide some examples and clear instructions on "how to add a new 3rrd-party provider" and even a way to generate really nice documentation with examples. Seems that we already pretty much have what we need for that very soon, it just needs agreement on the devlist and we still have time to discuss it before 2.0.
   
   WDYT?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-732396554


   [The Workflow run](https://github.com/apache/airflow/actions/runs/379677777) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526704452



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       I will also split this into two PRs - one adding "provider-package" and one adding connections, to keep it separate.
   
   Also after looking at how fast and easy it is to extract all the "provider" features (extra links, connections etc.) and keep in  the  provider.yaml,  I think maybe even should bring the discussion about conventions for 3rd-party providers to the devllst?  This was non-goal for AIP-8 and Airflow 2.0, but it seems we are SOOOO close to it that it's worth doing the extra effort and discuss and vote the 3rd-party provider conventions showing examples from our existing providers.
   
   The nice thing about it that we can provide some examples and clear instructions on "how to add a new 3rrd-party provider" and even a way to generate really nice documentation with examples for those. 
   
   Seems that we already pretty much have what we need for that very soon, it just needs agreement on the devlist and we still have time to discuss it before 2.0.
   
   WDYT?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526742380



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       As discussed with @mik-laj -> I will go back to absolute imports and change it to verify if they are the same as provider-package. It will be easier to maintain it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531541704



##########
File path: airflow/providers/microsoft/mssql/hooks/mssql.py
##########
@@ -70,3 +71,9 @@ def set_autocommit(
 
     def get_autocommit(self, conn: pymssql.connect):  # pylint: disable=c-extension-no-member
         return conn.autocommit_state
+
+
+class MsSqlHookWithOdbcHook(OdbcHook):
+    """Interact with Microsoft SQL Server via ODBC."""
+
+    conn_type = 'mssql'

Review comment:
       I think just mssql_default is needed. Since it derives from the OdbcHook, it also gets those class-level variables. But indeed if we want to keep consistency default_conn_name should be mssql_default. the conn_name_attr should be still `odbc_conn_id` not `mssql_conn_id` because that's the name of the parameter initialized in the OdbcHoook.
   
   But yeah - adding it here as well explicitly rather than relying the derived value is a good idea.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531450044



##########
File path: airflow/providers_manager.py
##########
@@ -36,52 +39,232 @@
 
 
 def _create_validator():
+    """Creates JSON schema validator from the provider.yaml.schema.json"""
     schema = json.loads(importlib_resources.read_text('airflow', 'provider.yaml.schema.json'))
     cls = jsonschema.validators.validator_for(schema)
     validator = cls(schema)
     return validator
 
 
 class ProvidersManager:
-    """Manages all provider packages."""
+    """
+    Manages all provider packages. This is a Singleton class. The first time it is
+    instantiated, it discovers all available providers in installed packages and
+    local source folders (if airflow is run from sources).
+    """
+
+    _instance = None
+    resource_version = "0"
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
 
     def __init__(self):
-        self._provider_directory = {}
-        try:
-            from airflow import providers
-        except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
-            return
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
+        self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
-        self.__find_all_providers(providers.__path__)
+        # Local source folders are loaded first. They should take precedence over the package ones for
+        # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
+        # So there is no risk we are going to override package provider accidentally. This can only happen
+        # in case of local development
+        self._discover_all_airflow_builtin_providers_from_local_sources()
+        self._discover_all_providers_from_packages()
+        self._discover_hooks()
+        self._sort_provider_dictionary()
+        self._sort_hooks_dictionary()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    def _sort_hooks_dictionary(self):
+        """
+        Creates provider_directory as sorted (by package_name) OrderedDict.
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
+        Duplicates are removed from "package" providers in case corresponding "folder" provider is found.
+        The "folder" providers are from local sources (packages do not contain provider.yaml files),
+        so if someone has airflow installed from local sources, the providers are imported from there
+        first so, provider information should be taken from there.
+        :return:
+        """
+        sorted_dict = OrderedDict()
+        for connection_type in sorted(self._hooks_dict.keys()):
+            sorted_dict[connection_type] = self._hooks_dict[connection_type]
+        self._hooks_dict = sorted_dict
+
+    def _sort_provider_dictionary(self):
+        """
+        Sort provider_dictionary using OrderedDict.
+
+        The dictionary gets sorted so that when you iterate through it, the providers are by
+        default returned in alphabetical order.
+        """
+        sorted_dict = OrderedDict()
+        for provider_name in sorted(self._provider_dict.keys()):
+            sorted_dict[provider_name] = self._provider_dict[provider_name]
+        self._provider_dict = sorted_dict
+
+    def _discover_all_providers_from_packages(self) -> None:
+        """
+        Discovers all providers by scanning packages installed. The list of providers should be returned
+        via the 'apache_airflow_provider' entrypoint as a dictionary conforming to the
+        'airflow/provider.yaml.schema.json' schema.
+        """
+        for entry_point in pkg_resources.iter_entry_points('apache_airflow_provider'):
+            package_name = entry_point.dist.project_name
+            log.debug("Loading %s from package %s", entry_point, package_name)
+            version = entry_point.dist.version
             try:
-                imported_module = importlib.import_module(module_info.name)
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when importing %s:%s", module_info.name, e)
+                provider_info = entry_point.load()()
+            except pkg_resources.VersionConflict as e:
+                log.warning(
+                    "The provider package %s could not be registered because of version conflict : %s",
+                    package_name,
+                    e,
+                )
                 continue
-            try:
-                provider = importlib_resources.read_text(imported_module, 'provider.yaml')
-                provider_info = yaml.safe_load(provider)
-                self._validator.validate(provider_info)
-                self._provider_directory[provider_info['package-name']] = provider_info
-            except FileNotFoundError:
-                # This is OK - this is not a provider package
-                pass
-            except TypeError as e:
-                if "is not a package" not in str(e):
-                    log.warning("Error when loading 'provider.yaml' file from %s:%s}", module_info.name, e)
-                # Otherwise this is OK - this is likely a module
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when loading 'provider.yaml' file from %s:%s", module_info.name, e)
+            self._validator.validate(provider_info)
+            provider_info_package_name = provider_info['package-name']
+            if package_name != provider_info_package_name:
+                raise Exception(
+                    f"The package '{package_name}' from setuptools and "
+                    f"{provider_info_package_name} do not match. Please make sure they are"
+                    f"aligned"

Review comment:
       Joined the two lines.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735377178


   [The Workflow run](https://github.com/apache/airflow/actions/runs/389822268) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531547584



##########
File path: airflow/providers_manager.py
##########
@@ -36,52 +39,231 @@
 
 
 def _create_validator():
+    """Creates JSON schema validator from the provider.yaml.schema.json"""
     schema = json.loads(importlib_resources.read_text('airflow', 'provider.yaml.schema.json'))
     cls = jsonschema.validators.validator_for(schema)
     validator = cls(schema)
     return validator
 
 
 class ProvidersManager:
-    """Manages all provider packages."""
+    """
+    Manages all provider packages. This is a Singleton class. The first time it is
+    instantiated, it discovers all available providers in installed packages and
+    local source folders (if airflow is run from sources).
+    """
+
+    _instance = None
+    resource_version = "0"
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
 
     def __init__(self):
-        self._provider_directory = {}
-        try:
-            from airflow import providers
-        except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
-            return
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
+        self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
-        self.__find_all_providers(providers.__path__)
+        # Local source folders are loaded first. They should take precedence over the package ones for
+        # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
+        # So there is no risk we are going to override package provider accidentally. This can only happen
+        # in case of local development
+        self._discover_all_airflow_builtin_providers_from_local_sources()
+        self._discover_all_providers_from_packages()
+        self._discover_hooks()
+        self._sort_provider_dictionary()
+        self._sort_hooks_dictionary()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    def _sort_hooks_dictionary(self):
+        """
+        Creates provider_directory as sorted (by package_name) OrderedDict.
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
+        Duplicates are removed from "package" providers in case corresponding "folder" provider is found.
+        The "folder" providers are from local sources (packages do not contain provider.yaml files),
+        so if someone has airflow installed from local sources, the providers are imported from there
+        first so, provider information should be taken from there.
+        :return:
+        """
+        sorted_dict = OrderedDict()
+        for connection_type in sorted(self._hooks_dict.keys()):
+            sorted_dict[connection_type] = self._hooks_dict[connection_type]
+        self._hooks_dict = sorted_dict

Review comment:
       Yep.

##########
File path: airflow/providers_manager.py
##########
@@ -36,52 +39,231 @@
 
 
 def _create_validator():
+    """Creates JSON schema validator from the provider.yaml.schema.json"""
     schema = json.loads(importlib_resources.read_text('airflow', 'provider.yaml.schema.json'))
     cls = jsonschema.validators.validator_for(schema)
     validator = cls(schema)
     return validator
 
 
 class ProvidersManager:
-    """Manages all provider packages."""
+    """
+    Manages all provider packages. This is a Singleton class. The first time it is
+    instantiated, it discovers all available providers in installed packages and
+    local source folders (if airflow is run from sources).
+    """
+
+    _instance = None
+    resource_version = "0"
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
 
     def __init__(self):
-        self._provider_directory = {}
-        try:
-            from airflow import providers
-        except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
-            return
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
+        self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
-        self.__find_all_providers(providers.__path__)
+        # Local source folders are loaded first. They should take precedence over the package ones for
+        # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
+        # So there is no risk we are going to override package provider accidentally. This can only happen
+        # in case of local development
+        self._discover_all_airflow_builtin_providers_from_local_sources()
+        self._discover_all_providers_from_packages()
+        self._discover_hooks()
+        self._sort_provider_dictionary()
+        self._sort_hooks_dictionary()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    def _sort_hooks_dictionary(self):
+        """
+        Creates provider_directory as sorted (by package_name) OrderedDict.
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
+        Duplicates are removed from "package" providers in case corresponding "folder" provider is found.
+        The "folder" providers are from local sources (packages do not contain provider.yaml files),
+        so if someone has airflow installed from local sources, the providers are imported from there
+        first so, provider information should be taken from there.
+        :return:
+        """
+        sorted_dict = OrderedDict()
+        for connection_type in sorted(self._hooks_dict.keys()):
+            sorted_dict[connection_type] = self._hooks_dict[connection_type]
+        self._hooks_dict = sorted_dict
+
+    def _sort_provider_dictionary(self):
+        """
+        Sort provider_dictionary using OrderedDict.
+
+        The dictionary gets sorted so that when you iterate through it, the providers are by
+        default returned in alphabetical order.
+        """
+        sorted_dict = OrderedDict()
+        for provider_name in sorted(self._provider_dict.keys()):
+            sorted_dict[provider_name] = self._provider_dict[provider_name]
+        self._provider_dict = sorted_dict

Review comment:
       yep




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-734806825


   All fixed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-730103510


   [The Workflow run](https://github.com/apache/airflow/actions/runs/371491579) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532059119



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {

Review comment:
       I added a simple check now, and I think we should stay with explicitly listing the hook classes.
   
   Just checking one amazon provider - those are Amazon's Hooks: none of them were ever "registered" in the code as "connection type to hook" nor with the customized ConnectionForm, nor in the `connection_form.js'. See below for the list of those. Similarly answering the https://github.com/apache/airflow/pull/12466#pullrequestreview-536502242  question after being a little more informed. I do not think in this case we should require all Hooks to have those extra class fields (at least not now). The use of those fields for Hooks that do not register themselves in the Core is unclear and I think we should not require the connections to have those values. 
   
   Maybe at some point in time we want to extend the API of the BaseHook object and make it obligatory for all Hooks, but I am not sure it is something we should do now. 
   
   IMHO - each provider should be free to choose which Hooks of that provider should be "registered" by the core for the usage in the UI and "connection to hook mapping". @ashb @mik-laj - WDYT?
   
   Here are the Amazon hooks which do not require any special "registering" in the core (and never did):
   
   ```
   'airflow.providers.amazon.aws.hooks.athena.AWSAthenaHook'
   'airflow.providers.amazon.aws.hooks.athena.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.aws_dynamodb.AwsDynamoDBHook'
   'airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.base_aws.BaseHook'
   'airflow.providers.amazon.aws.hooks.batch_client.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.batch_client.AwsBatchClientHook'
   'airflow.providers.amazon.aws.hooks.batch_waiters.AwsBatchClientHook'
   'airflow.providers.amazon.aws.hooks.batch_waiters.AwsBatchWaitersHook'
   'airflow.providers.amazon.aws.hooks.cloud_formation.AWSCloudFormationHook'
   'airflow.providers.amazon.aws.hooks.cloud_formation.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.datasync.AWSDataSyncHook'
   'airflow.providers.amazon.aws.hooks.datasync.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.dynamodb.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.dynamodb.AwsDynamoDBHook'
   'airflow.providers.amazon.aws.hooks.ec2.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.ec2.EC2Hook'
   'airflow.providers.amazon.aws.hooks.elasticache_replication_group.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.elasticache_replication_group.ElastiCacheReplicationGroupHook'
   'airflow.providers.amazon.aws.hooks.emr.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.emr.EmrHook'
   'airflow.providers.amazon.aws.hooks.glacier.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.glacier.GlacierHook'
   'airflow.providers.amazon.aws.hooks.glue.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.glue.AwsGlueJobHook'
   'airflow.providers.amazon.aws.hooks.glue_catalog.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.glue_catalog.AwsGlueCatalogHook'
   'airflow.providers.amazon.aws.hooks.kinesis.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.kinesis.AwsFirehoseHook'
   'airflow.providers.amazon.aws.hooks.lambda_function.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.lambda_function.AwsLambdaHook'
   'airflow.providers.amazon.aws.hooks.logs.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.logs.AwsLogsHook'
   'airflow.providers.amazon.aws.hooks.redshift.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.redshift.RedshiftHook'
   'airflow.providers.amazon.aws.hooks.s3.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.s3.S3Hook'
   'airflow.providers.amazon.aws.hooks.sagemaker.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.sagemaker.AwsLogsHook'
   'airflow.providers.amazon.aws.hooks.sagemaker.S3Hook'
   'airflow.providers.amazon.aws.hooks.sagemaker.SageMakerHook'
   'airflow.providers.amazon.aws.hooks.secrets_manager.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.secrets_manager.SecretsManagerHook'
   'airflow.providers.amazon.aws.hooks.ses.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.ses.SESHook'
   'airflow.providers.amazon.aws.hooks.sns.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.sns.AwsSnsHook'
   'airflow.providers.amazon.aws.hooks.sqs.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.sqs.SQSHook'
   'airflow.providers.amazon.aws.hooks.step_function.AwsBaseHook'
   'airflow.providers.amazon.aws.hooks.step_function.StepFunctionHook'
   ```
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-733903724


   [The Workflow run](https://github.com/apache/airflow/actions/runs/383790416) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r527276595



##########
File path: airflow/providers_manager.py
##########
@@ -46,42 +45,134 @@ class ProvidersManager:
     """Manages all provider packages."""
 
     def __init__(self):
+        self.__log = logging.getLogger(__name__)

Review comment:
       Privacy by design. I switched to the (recommended by many) mode of programming - only expose what you expect to be used and everything else private by design. That makes it much more encapsulated and makes it easier to avoid accidental uses and unintended side effects.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531554294



##########
File path: airflow/providers_manager.py
##########
@@ -36,52 +39,231 @@
 
 
 def _create_validator():
+    """Creates JSON schema validator from the provider.yaml.schema.json"""
     schema = json.loads(importlib_resources.read_text('airflow', 'provider.yaml.schema.json'))
     cls = jsonschema.validators.validator_for(schema)
     validator = cls(schema)
     return validator
 
 
 class ProvidersManager:
-    """Manages all provider packages."""
+    """
+    Manages all provider packages. This is a Singleton class. The first time it is
+    instantiated, it discovers all available providers in installed packages and
+    local source folders (if airflow is run from sources).
+    """
+
+    _instance = None
+    resource_version = "0"
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
 
     def __init__(self):
-        self._provider_directory = {}
-        try:
-            from airflow import providers
-        except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
-            return
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
+        self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
-        self.__find_all_providers(providers.__path__)
+        # Local source folders are loaded first. They should take precedence over the package ones for
+        # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
+        # So there is no risk we are going to override package provider accidentally. This can only happen
+        # in case of local development
+        self._discover_all_airflow_builtin_providers_from_local_sources()
+        self._discover_all_providers_from_packages()
+        self._discover_hooks()
+        self._sort_provider_dictionary()
+        self._sort_hooks_dictionary()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    def _sort_hooks_dictionary(self):
+        """
+        Creates provider_directory as sorted (by package_name) OrderedDict.
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
+        Duplicates are removed from "package" providers in case corresponding "folder" provider is found.
+        The "folder" providers are from local sources (packages do not contain provider.yaml files),
+        so if someone has airflow installed from local sources, the providers are imported from there
+        first so, provider information should be taken from there.
+        :return:
+        """
+        sorted_dict = OrderedDict()
+        for connection_type in sorted(self._hooks_dict.keys()):
+            sorted_dict[connection_type] = self._hooks_dict[connection_type]
+        self._hooks_dict = sorted_dict
+
+    def _sort_provider_dictionary(self):
+        """
+        Sort provider_dictionary using OrderedDict.
+
+        The dictionary gets sorted so that when you iterate through it, the providers are by
+        default returned in alphabetical order.
+        """
+        sorted_dict = OrderedDict()
+        for provider_name in sorted(self._provider_dict.keys()):
+            sorted_dict[provider_name] = self._provider_dict[provider_name]
+        self._provider_dict = sorted_dict
+
+    def _discover_all_providers_from_packages(self) -> None:
+        """
+        Discovers all providers by scanning packages installed. The list of providers should be returned
+        via the 'apache_airflow_provider' entrypoint as a dictionary conforming to the
+        'airflow/provider.yaml.schema.json' schema.
+        """
+        for entry_point in pkg_resources.iter_entry_points('apache_airflow_provider'):
+            package_name = entry_point.dist.project_name
+            log.debug("Loading %s from package %s", entry_point, package_name)
+            version = entry_point.dist.version
             try:
-                imported_module = importlib.import_module(module_info.name)
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when importing %s:%s", module_info.name, e)
+                provider_info = entry_point.load()()
+            except pkg_resources.VersionConflict as e:
+                log.warning(
+                    "The provider package %s could not be registered because of version conflict : %s",
+                    package_name,
+                    e,
+                )
                 continue
-            try:
-                provider = importlib_resources.read_text(imported_module, 'provider.yaml')
-                provider_info = yaml.safe_load(provider)
-                self._validator.validate(provider_info)
-                self._provider_directory[provider_info['package-name']] = provider_info
-            except FileNotFoundError:
-                # This is OK - this is not a provider package
-                pass
-            except TypeError as e:
-                if "is not a package" not in str(e):
-                    log.warning("Error when loading 'provider.yaml' file from %s:%s}", module_info.name, e)
-                # Otherwise this is OK - this is likely a module
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when loading 'provider.yaml' file from %s:%s", module_info.name, e)
+            self._validator.validate(provider_info)
+            provider_info_package_name = provider_info['package-name']
+            if package_name != provider_info_package_name:
+                raise Exception(
+                    f"The package '{package_name}' from setuptools and "
+                    f"{provider_info_package_name} do not match. Please make sure they are aligned"
+                )
+            if package_name not in self._provider_dict:
+                self._provider_dict[package_name] = (version, provider_info)
+            else:
+                log.warning(
+                    "The provider for package '%s' could not be registered from because providers for that "
+                    "package name have already been registered",
+                    package_name,
+                )
+
+    def _discover_all_airflow_builtin_providers_from_local_sources(self) -> None:
+        """
+        Finds all built-in airflow providers if airflow is run from the local sources.
+        It finds `provider.yaml` files for all such providers and registers the providers using those.
+
+        This 'provider.yaml' scanning takes precedence over scanning packages installed
+        in case you have both sources and packages installed, the providers will be loaded from
+        the "airflow" sources rather than from the packages.
+        """
+        try:
+            import airflow.providers
+        except ImportError:
+            log.info("You have no providers installed.")
+            return
+        try:
+            for path in airflow.providers.__path__:
+                self._add_provider_info_from_local_source_files_on_path(path)
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning("Error when loading 'provider.yaml' files from airflow sources: %s", e)
+
+    def _add_provider_info_from_local_source_files_on_path(self, path) -> None:
+        """
+        Finds all the provider.yaml files in the directory specified.
+
+        :param path: path where to look for provider.yaml files
+        """
+        root_path = path
+        for folder, subdirs, files in os.walk(path, topdown=True):
+            for filename in fnmatch.filter(files, "provider.yaml"):
+                package_name = "apache-airflow-providers" + folder[len(root_path) :].replace(os.sep, "-")
+                self._add_provider_info_from_local_source_file(os.path.join(folder, filename), package_name)
+                subdirs[:] = []
+
+    def _add_provider_info_from_local_source_file(self, path, package_name) -> None:
+        """
+        Parses found provider.yaml file and adds found provider to the dictionary.
+
+        :param path: full file path of the provider.yaml file
+        :param package_name: name of the package
+        """
+        try:
+            log.debug("Loading %s from %s", package_name, path)
+            with open(path) as provider_yaml_file:
+                provider_info = yaml.safe_load(provider_yaml_file)
+            self._validator.validate(provider_info)
+
+            version = provider_info['versions'][0]
+            if package_name not in self._provider_dict:
+                self._provider_dict[package_name] = (version, provider_info)
+            else:
+                log.warning(
+                    "The providers for package '%s' could not be registered because providers for that "
+                    "package name have already been registered",
+                    package_name,
+                )
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning("Error when loading '%s': %s", path, e)
+
+    def _discover_hooks(self) -> None:
+        """Retrieves all connections defined in the providers"""
+        for name, provider in self._provider_dict.items():
+            provider_package = name
+            hook_class_names = provider[1].get("hook-class-names")
+            if hook_class_names:
+                for hook_class_name in hook_class_names:
+                    self._add_hook(hook_class_name, provider_package)
+
+    def _add_hook(self, hook_class_name, provider_package) -> None:
+        """
+        Adds hook class name to list of hooks
+
+        :param hook_class_name: name of the Hook class
+        :param provider_package: provider package adding the hook
+        """
+        if provider_package.startswith("apache-airflow"):
+            provider_path = provider_package[len("apache-") :].replace("-", ".")
+            if not hook_class_name.startswith(provider_path):
+                log.warning(
+                    "Sanity check failed when importing '%s' from '%s' package. It should start with '%s'",
+                    hook_class_name,
+                    provider_package,
+                    provider_path,
+                )
+                return
+        if hook_class_name in self._hooks_dict:
+            log.warning(
+                "The hook_class '%s' has been already registered.",
+                hook_class_name,
+            )
+            return
+        try:
+            module, class_name = hook_class_name.rsplit('.', maxsplit=1)
+            hook_class = getattr(importlib.import_module(module), class_name)
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning(
+                "Exception when importing '%s' from '%s' package: %s",
+                hook_class_name,
+                provider_package,
+                e,
+            )
+            return
+        conn_type = getattr(hook_class, 'conn_type')
+        if not conn_type:
+            log.warning(
+                "The hook_class '%s' misses connection_type attribute and cannot be registered",
+                hook_class,
+            )
+            return
+        connection_if_attribute_name = getattr(hook_class, 'conn_name_attr')
+        if not connection_if_attribute_name:
+            log.warning(
+                "The hook_class '%s' misses conn_name_attr attribute and cannot be registered",
+                hook_class,
+            )
+            return
+
+        self._hooks_dict[conn_type] = (hook_class_name, connection_if_attribute_name)

Review comment:
       HAppy to change it here anyway.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531509082



##########
File path: airflow/providers_manager.py
##########
@@ -36,52 +39,231 @@
 
 
 def _create_validator():
+    """Creates JSON schema validator from the provider.yaml.schema.json"""
     schema = json.loads(importlib_resources.read_text('airflow', 'provider.yaml.schema.json'))
     cls = jsonschema.validators.validator_for(schema)
     validator = cls(schema)
     return validator
 
 
 class ProvidersManager:
-    """Manages all provider packages."""
+    """
+    Manages all provider packages. This is a Singleton class. The first time it is
+    instantiated, it discovers all available providers in installed packages and
+    local source folders (if airflow is run from sources).
+    """
+
+    _instance = None
+    resource_version = "0"
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
 
     def __init__(self):
-        self._provider_directory = {}
-        try:
-            from airflow import providers
-        except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
-            return
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
+        self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
-        self.__find_all_providers(providers.__path__)
+        # Local source folders are loaded first. They should take precedence over the package ones for
+        # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
+        # So there is no risk we are going to override package provider accidentally. This can only happen
+        # in case of local development
+        self._discover_all_airflow_builtin_providers_from_local_sources()
+        self._discover_all_providers_from_packages()
+        self._discover_hooks()
+        self._sort_provider_dictionary()
+        self._sort_hooks_dictionary()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    def _sort_hooks_dictionary(self):
+        """
+        Creates provider_directory as sorted (by package_name) OrderedDict.
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
+        Duplicates are removed from "package" providers in case corresponding "folder" provider is found.
+        The "folder" providers are from local sources (packages do not contain provider.yaml files),
+        so if someone has airflow installed from local sources, the providers are imported from there
+        first so, provider information should be taken from there.
+        :return:
+        """
+        sorted_dict = OrderedDict()
+        for connection_type in sorted(self._hooks_dict.keys()):
+            sorted_dict[connection_type] = self._hooks_dict[connection_type]
+        self._hooks_dict = sorted_dict
+
+    def _sort_provider_dictionary(self):
+        """
+        Sort provider_dictionary using OrderedDict.
+
+        The dictionary gets sorted so that when you iterate through it, the providers are by
+        default returned in alphabetical order.
+        """
+        sorted_dict = OrderedDict()
+        for provider_name in sorted(self._provider_dict.keys()):
+            sorted_dict[provider_name] = self._provider_dict[provider_name]
+        self._provider_dict = sorted_dict
+
+    def _discover_all_providers_from_packages(self) -> None:
+        """
+        Discovers all providers by scanning packages installed. The list of providers should be returned
+        via the 'apache_airflow_provider' entrypoint as a dictionary conforming to the
+        'airflow/provider.yaml.schema.json' schema.
+        """
+        for entry_point in pkg_resources.iter_entry_points('apache_airflow_provider'):
+            package_name = entry_point.dist.project_name
+            log.debug("Loading %s from package %s", entry_point, package_name)
+            version = entry_point.dist.version
             try:
-                imported_module = importlib.import_module(module_info.name)
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when importing %s:%s", module_info.name, e)
+                provider_info = entry_point.load()()
+            except pkg_resources.VersionConflict as e:
+                log.warning(
+                    "The provider package %s could not be registered because of version conflict : %s",
+                    package_name,
+                    e,
+                )
                 continue
-            try:
-                provider = importlib_resources.read_text(imported_module, 'provider.yaml')
-                provider_info = yaml.safe_load(provider)
-                self._validator.validate(provider_info)
-                self._provider_directory[provider_info['package-name']] = provider_info
-            except FileNotFoundError:
-                # This is OK - this is not a provider package
-                pass
-            except TypeError as e:
-                if "is not a package" not in str(e):
-                    log.warning("Error when loading 'provider.yaml' file from %s:%s}", module_info.name, e)
-                # Otherwise this is OK - this is likely a module
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when loading 'provider.yaml' file from %s:%s", module_info.name, e)
+            self._validator.validate(provider_info)
+            provider_info_package_name = provider_info['package-name']
+            if package_name != provider_info_package_name:
+                raise Exception(
+                    f"The package '{package_name}' from setuptools and "
+                    f"{provider_info_package_name} do not match. Please make sure they are aligned"
+                )
+            if package_name not in self._provider_dict:
+                self._provider_dict[package_name] = (version, provider_info)
+            else:
+                log.warning(
+                    "The provider for package '%s' could not be registered from because providers for that "
+                    "package name have already been registered",
+                    package_name,
+                )
+
+    def _discover_all_airflow_builtin_providers_from_local_sources(self) -> None:
+        """
+        Finds all built-in airflow providers if airflow is run from the local sources.
+        It finds `provider.yaml` files for all such providers and registers the providers using those.
+
+        This 'provider.yaml' scanning takes precedence over scanning packages installed
+        in case you have both sources and packages installed, the providers will be loaded from
+        the "airflow" sources rather than from the packages.
+        """
+        try:
+            import airflow.providers
+        except ImportError:
+            log.info("You have no providers installed.")
+            return
+        try:
+            for path in airflow.providers.__path__:
+                self._add_provider_info_from_local_source_files_on_path(path)
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning("Error when loading 'provider.yaml' files from airflow sources: %s", e)
+
+    def _add_provider_info_from_local_source_files_on_path(self, path) -> None:
+        """
+        Finds all the provider.yaml files in the directory specified.
+
+        :param path: path where to look for provider.yaml files
+        """
+        root_path = path
+        for folder, subdirs, files in os.walk(path, topdown=True):
+            for filename in fnmatch.filter(files, "provider.yaml"):
+                package_name = "apache-airflow-providers" + folder[len(root_path) :].replace(os.sep, "-")
+                self._add_provider_info_from_local_source_file(os.path.join(folder, filename), package_name)
+                subdirs[:] = []
+
+    def _add_provider_info_from_local_source_file(self, path, package_name) -> None:
+        """
+        Parses found provider.yaml file and adds found provider to the dictionary.
+
+        :param path: full file path of the provider.yaml file
+        :param package_name: name of the package
+        """
+        try:
+            log.debug("Loading %s from %s", package_name, path)
+            with open(path) as provider_yaml_file:
+                provider_info = yaml.safe_load(provider_yaml_file)
+            self._validator.validate(provider_info)
+
+            version = provider_info['versions'][0]
+            if package_name not in self._provider_dict:
+                self._provider_dict[package_name] = (version, provider_info)
+            else:
+                log.warning(
+                    "The providers for package '%s' could not be registered because providers for that "
+                    "package name have already been registered",
+                    package_name,
+                )
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning("Error when loading '%s': %s", path, e)
+
+    def _discover_hooks(self) -> None:
+        """Retrieves all connections defined in the providers"""
+        for name, provider in self._provider_dict.items():
+            provider_package = name
+            hook_class_names = provider[1].get("hook-class-names")
+            if hook_class_names:
+                for hook_class_name in hook_class_names:
+                    self._add_hook(hook_class_name, provider_package)
+
+    def _add_hook(self, hook_class_name, provider_package) -> None:
+        """
+        Adds hook class name to list of hooks
+
+        :param hook_class_name: name of the Hook class
+        :param provider_package: provider package adding the hook
+        """
+        if provider_package.startswith("apache-airflow"):
+            provider_path = provider_package[len("apache-") :].replace("-", ".")
+            if not hook_class_name.startswith(provider_path):
+                log.warning(
+                    "Sanity check failed when importing '%s' from '%s' package. It should start with '%s'",
+                    hook_class_name,
+                    provider_package,
+                    provider_path,
+                )
+                return
+        if hook_class_name in self._hooks_dict:
+            log.warning(
+                "The hook_class '%s' has been already registered.",
+                hook_class_name,
+            )
+            return
+        try:
+            module, class_name = hook_class_name.rsplit('.', maxsplit=1)
+            hook_class = getattr(importlib.import_module(module), class_name)
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning(
+                "Exception when importing '%s' from '%s' package: %s",
+                hook_class_name,
+                provider_package,
+                e,
+            )
+            return
+        conn_type = getattr(hook_class, 'conn_type')
+        if not conn_type:
+            log.warning(
+                "The hook_class '%s' misses connection_type attribute and cannot be registered",
+                hook_class,
+            )
+            return
+        connection_if_attribute_name = getattr(hook_class, 'conn_name_attr')
+        if not connection_if_attribute_name:
+            log.warning(
+                "The hook_class '%s' misses conn_name_attr attribute and cannot be registered",
+                hook_class,
+            )
+            return
+
+        self._hooks_dict[conn_type] = (hook_class_name, connection_if_attribute_name)

Review comment:
       Oh, wrong PR for this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r528965905



##########
File path: airflow/providers/apache/hive/hooks/hive.py
##########
@@ -78,9 +78,13 @@ class HiveCliHook(BaseHook):
     :type  mapred_job_name: str
     """
 
+    conn_name_attr = 'hive_cli_conn_id'

Review comment:
       I'm not sure the convention from SQL hooks is the best one -- those are some of the oldest parts of Airflow and not very "Good Practice".
   
   I'm okay with it if you think this way is better, but may other places in Airflow we have UPPERCASE style.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531449484



##########
File path: airflow/models/connection.py
##########
@@ -30,69 +31,22 @@
 from airflow.exceptions import AirflowException, AirflowNotFoundException
 from airflow.models.base import ID_LEN, Base
 from airflow.models.crypto import get_fernet
+from airflow.providers_manager import ProvidersManager
 from airflow.utils.log.logging_mixin import LoggingMixin
 from airflow.utils.module_loading import import_string
 
-# A map that assigns a connection type to a tuple that contains
-# the path of the class and the name of the conn_id key parameter.
-# PLEASE KEEP BELOW LIST IN ALPHABETICAL ORDER.
-CONN_TYPE_TO_HOOK = {
-    "azure_batch": (
-        "airflow.providers.microsoft.azure.hooks.azure_batch.AzureBatchHook",
-        "azure_batch_conn_id",
-    ),
-    "azure_cosmos": (
-        "airflow.providers.microsoft.azure.hooks.azure_cosmos.AzureCosmosDBHook",
-        "azure_cosmos_conn_id",
-    ),
-    "azure_data_lake": (
-        "airflow.providers.microsoft.azure.hooks.azure_data_lake.AzureDataLakeHook",
-        "azure_data_lake_conn_id",
-    ),
-    "cassandra": ("airflow.providers.apache.cassandra.hooks.cassandra.CassandraHook", "cassandra_conn_id"),
-    "cloudant": ("airflow.providers.cloudant.hooks.cloudant.CloudantHook", "cloudant_conn_id"),
-    "dataprep": ("airflow.providers.google.cloud.hooks.dataprep.GoogleDataprepHook", "dataprep_default"),
-    "docker": ("airflow.providers.docker.hooks.docker.DockerHook", "docker_conn_id"),
-    "elasticsearch": (
-        "airflow.providers.elasticsearch.hooks.elasticsearch.ElasticsearchHook",
-        "elasticsearch_conn_id",
-    ),
-    "exasol": ("airflow.providers.exasol.hooks.exasol.ExasolHook", "exasol_conn_id"),
-    "gcpcloudsql": (
-        "airflow.providers.google.cloud.hooks.cloud_sql.CloudSQLDatabaseHook",
-        "gcp_cloudsql_conn_id",
-    ),
-    "gcpssh": (
-        "airflow.providers.google.cloud.hooks.compute_ssh.ComputeEngineSSHHook",
-        "gcp_conn_id",
-    ),
-    "google_cloud_platform": (
-        "airflow.providers.google.cloud.hooks.bigquery.BigQueryHook",
-        "bigquery_conn_id",
-    ),
-    "grpc": ("airflow.providers.grpc.hooks.grpc.GrpcHook", "grpc_conn_id"),
-    "hive_cli": ("airflow.providers.apache.hive.hooks.hive.HiveCliHook", "hive_cli_conn_id"),
-    "hiveserver2": ("airflow.providers.apache.hive.hooks.hive.HiveServer2Hook", "hiveserver2_conn_id"),
-    "imap": ("airflow.providers.imap.hooks.imap.ImapHook", "imap_conn_id"),
-    "jdbc": ("airflow.providers.jdbc.hooks.jdbc.JdbcHook", "jdbc_conn_id"),
-    "jira": ("airflow.providers.jira.hooks.jira.JiraHook", "jira_conn_id"),
-    "kubernetes": ("airflow.providers.cncf.kubernetes.hooks.kubernetes.KubernetesHook", "kubernetes_conn_id"),
-    "mongo": ("airflow.providers.mongo.hooks.mongo.MongoHook", "conn_id"),
-    "mssql": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "mysql": ("airflow.providers.mysql.hooks.mysql.MySqlHook", "mysql_conn_id"),
-    "odbc": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "oracle": ("airflow.providers.oracle.hooks.oracle.OracleHook", "oracle_conn_id"),
-    "pig_cli": ("airflow.providers.apache.pig.hooks.pig.PigCliHook", "pig_cli_conn_id"),
-    "postgres": ("airflow.providers.postgres.hooks.postgres.PostgresHook", "postgres_conn_id"),
-    "presto": ("airflow.providers.presto.hooks.presto.PrestoHook", "presto_conn_id"),
-    "redis": ("airflow.providers.redis.hooks.redis.RedisHook", "redis_conn_id"),
-    "snowflake": ("airflow.providers.snowflake.hooks.snowflake.SnowflakeHook", "snowflake_conn_id"),
-    "sqlite": ("airflow.providers.sqlite.hooks.sqlite.SqliteHook", "sqlite_conn_id"),
-    "tableau": ("airflow.providers.salesforce.hooks.tableau.TableauHook", "tableau_conn_id"),
-    "vertica": ("airflow.providers.vertica.hooks.vertica.VerticaHook", "vertica_conn_id"),
-    "wasb": ("airflow.providers.microsoft.azure.hooks.wasb.WasbHook", "wasb_conn_id"),
-}
-# PLEASE KEEP ABOVE LIST IN ALPHABETICAL ORDER.
+cache = lru_cache(maxsize=None)
+
+
+@cache

Review comment:
       Ype




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532194601



##########
File path: airflow/plugins_manager.py
##########
@@ -30,6 +30,7 @@
 import importlib_metadata
 
 from airflow import settings
+from airflow.utils.entry_points_with_dist import entry_points_with_dist

Review comment:
       My problem with calling the module `entry_points_with_dist`  is that it makes it a single use module -- we'd almost never be able to put anything else in here.
   
   How about `airflow.utils.entry_points` then?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531490635



##########
File path: airflow/providers_manager.py
##########
@@ -36,52 +39,231 @@
 
 
 def _create_validator():
+    """Creates JSON schema validator from the provider.yaml.schema.json"""
     schema = json.loads(importlib_resources.read_text('airflow', 'provider.yaml.schema.json'))
     cls = jsonschema.validators.validator_for(schema)
     validator = cls(schema)
     return validator
 
 
 class ProvidersManager:
-    """Manages all provider packages."""
+    """
+    Manages all provider packages. This is a Singleton class. The first time it is
+    instantiated, it discovers all available providers in installed packages and
+    local source folders (if airflow is run from sources).
+    """
+
+    _instance = None
+    resource_version = "0"
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
 
     def __init__(self):
-        self._provider_directory = {}
-        try:
-            from airflow import providers
-        except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
-            return
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
+        self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
-        self.__find_all_providers(providers.__path__)
+        # Local source folders are loaded first. They should take precedence over the package ones for
+        # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
+        # So there is no risk we are going to override package provider accidentally. This can only happen
+        # in case of local development
+        self._discover_all_airflow_builtin_providers_from_local_sources()
+        self._discover_all_providers_from_packages()
+        self._discover_hooks()
+        self._sort_provider_dictionary()
+        self._sort_hooks_dictionary()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    def _sort_hooks_dictionary(self):
+        """
+        Creates provider_directory as sorted (by package_name) OrderedDict.
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
+        Duplicates are removed from "package" providers in case corresponding "folder" provider is found.
+        The "folder" providers are from local sources (packages do not contain provider.yaml files),
+        so if someone has airflow installed from local sources, the providers are imported from there
+        first so, provider information should be taken from there.
+        :return:
+        """
+        sorted_dict = OrderedDict()
+        for connection_type in sorted(self._hooks_dict.keys()):
+            sorted_dict[connection_type] = self._hooks_dict[connection_type]
+        self._hooks_dict = sorted_dict

Review comment:
       ```suggestion
           self._hooks_dict = OrderedDict(sorted(self._hooks_dict.items()))
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532159860



##########
File path: airflow/providers/google/cloud/hooks/bigquery.py
##########
@@ -67,7 +67,9 @@
 class BigQueryHook(GoogleBaseHook, DbApiHook):
     """Interact with BigQuery. This hook uses the Google Cloud connection."""
 
-    conn_name_attr = 'gcp_conn_id'  # type: str
+    conn_name_attr = 'gcp_conn_id'
+    default_conn_name = 'google_cloud_default'
+    conn_type = 'google_cloud_platform'

Review comment:
       Possibly. I just do not want to fix all those in this PR. I simply converted what I found in the original connection.py. 
   
   I propose that we start agreeing the approach there and standardize those, but this shoudl be a completely different PR (and if you want to propose some guidelines and standard/consistent approach for that - feel free). 
   
   I do not want to scope-creep that change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526779883



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       >         are we allowing any package to contain provider?
   
   Yes, I firmly believe this should be the case. By not having third party packages live under the `airflow.` namespace, it becomes much much easier for Us and hopefully users to know what is an official provider, and thus supported by us vs not.
   
   I think the best approach would be something like a setuptools entrypoint.
   
   That way _any_ package can register a provider, by adding something like this to it's setup.cfg:
   
   ```ini
   [options.entry_points]
   apache-airflow-provider=
       x=my_company.provider.x:register_provider
   ```
   
   And then in `my_company/provider/x.py` we could have something like this:
   
   ```python
   def get_provider_info():
       return {
           description: "X"
       }
   ```
   
   We don't need to include the package name, nor the version, as those are available from Entrypoint already.
   
   In short: Python already has this mechanism that is well used. Let's not re-invent something different.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r529497101



##########
File path: airflow/providers/apache/hive/hooks/hive.py
##########
@@ -78,9 +78,13 @@ class HiveCliHook(BaseHook):
     :type  mapred_job_name: str
     """
 
+    conn_name_attr = 'hive_cli_conn_id'

Review comment:
       I fully agree. But I did not want to break the compatibility (potentially someone could have used those values - especially the defaults. And my goal was just to make it works for providers, not having to deal with back-compatibility :). 
   
   There is a polish sentence I tend to use in such cases which I learned from my friend (but it has some curse-words and rhymes, so I won't translate it here verbatim, but it roughly translates to "consistent is better than perfect".
   
   I fully agree constants *should* be UPPERCASE of course. 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532049346



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {

Review comment:
       Above we have a 'hooks' section that accepts the python-module key. Should we take advantage of this?
   https://github.com/apache/airflow/blob/fb23e61317c0173f2e4508b28fd966c776739659/airflow/provider.yaml.schema.json#L120-L143




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735373343


   Just one last fixup - we had cyclic dependency Connection -> provider_manager -> plugins_manager -> Hook -> Connections. Solved it by extracting entrypoint_with_dist to a separate util module (as this is an util used by both plugins_manager and providers_manager now).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532069124



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {
+      "type": "array",
+      "items": {
+        "type": "string",
+        "description": "Hooks class names that provide connection types to core"

Review comment:
       It seems to me that we should try to describe the most general fields. If we need it, we can also describe lower level elements.
   
   From a validation perspective, these fields are ignored, but describing the list of items provides better hints in the IDE.
   Before:
   <img width="609" alt="Screenshot 2020-11-28 at 18 11 53" src="https://user-images.githubusercontent.com/12058428/100521656-44e2a100-31a5-11eb-9b84-1cc55bc016e9.png">
   
   After:
   <img width="639" alt="Screenshot 2020-11-28 at 18 10 02" src="https://user-images.githubusercontent.com/12058428/100521645-3a280c00-31a5-11eb-8412-efd798a608e9.png">
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-730070377


   I will still have to make some doc generation fixes, but at least it shoudl be good to review the approach


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531548191



##########
File path: airflow/cli/commands/provider_command.py
##########
@@ -0,0 +1,88 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Providers sub-commands"""
+from typing import Dict, List, Tuple
+
+import pygments
+import yaml
+from pygments.lexers.data import YamlLexer
+from tabulate import tabulate
+
+from airflow.providers_manager import ProvidersManager
+from airflow.utils.cli import should_use_colors
+from airflow.utils.code_utils import get_terminal_formatter
+
+
+def _tabulate_providers(providers: List[Dict], tablefmt: str):
+    tabulate_data = [
+        {
+            'Provider name': provider['package-name'],
+            'Description': provider['description'],
+            'Version': provider['versions'][0],
+        }
+        for version, provider in providers
+    ]
+
+    msg = tabulate(tabulate_data, tablefmt=tablefmt, headers='keys')
+    return msg
+
+
+def provider_get(args):
+    """Get a provider info."""
+    providers = ProvidersManager().providers
+    if args.provider_name in providers:
+        provider_version = providers[args.provider_name][0]
+        provider_info = providers[args.provider_name][1]

Review comment:
       Yep




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735371057


   The PR needs to run all tests because it modifies core of Airflow! Please rebase it to latest master or ask committer to re-run it!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-732354593


   This one shoudl be good to go as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526511344



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       @ashb @mik-laj -> this is also for preparation to be able to use 3rd-party providers. I've added provider-package to the provider.yaml to avoid repetition and made all modules/classes relative. This way it is really safe - because we will not be able to point accidentally to a class outside of our own package.
   
   I also added a check if provider package is properly named but only when it starts with "airflow.providers". 
   
   Currently we only support airflow.providers anyway. 
   
   It might be good to discuss now and agree future conventions for those 3rd-parties eventually
   
   1) are we allowing any package to contain provider?
   If not:
   2a) should all providers live in airflow.providers
   2b) should they live in airflow.contrib.providers (or similar)
   if 2a) or 2b) - > should we allow the packages to be named not following the convention :
   
   apache-airflow-providers-<PROVIDER> (with '.' replaced by -). 
   
   OR maybe some other approach? WDYT?
   
   WDYT?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532191669



##########
File path: airflow/providers/google/cloud/hooks/bigquery.py
##########
@@ -67,7 +67,9 @@
 class BigQueryHook(GoogleBaseHook, DbApiHook):
     """Interact with BigQuery. This hook uses the Google Cloud connection."""
 
-    conn_name_attr = 'gcp_conn_id'  # type: str
+    conn_name_attr = 'gcp_conn_id'
+    default_conn_name = 'google_cloud_default'
+    conn_type = 'google_cloud_platform'

Review comment:
       @ashb Since BigQuery implements DBApiHook, we should probably create a new connection type to be used by DbApiHook.
   
   I am afraid of registering all hooks because that will also mean that they will be loaded. And loading all modules from all Google libraries into memory is an expensive operation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735376106


   > Both providers_manager and plugins_manager are at the same "logical" level, so there is no reason why providers manager should import anything from plugins_manager. Why not plugins manager importing the same function from providers manager ?
   > 
   > So extracting it to common module makes perfect sense and is "cleaner" architecture - and this is what pylint complains about.
   
   Yes, in this case I agree it is cleaner.
   
   My general complaint with this pylint check is that in general I disagree that it detects actual problems.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531552850



##########
File path: airflow/providers/microsoft/mssql/hooks/mssql.py
##########
@@ -36,7 +37,7 @@ def __init__(self, *args, **kwargs) -> None:
             (
                 "This class is deprecated and will be removed in Airflow 2.0.\n"
                 "pymssql is discontinued.  See https://github.com/pymssql/pymssql/issues/668.\n"
-                "Please use `airflow.providers.odbc.hooks.odbc.OdbcHook`"

Review comment:
       I did not even know it's been revived. It was not clear from the warning still there - but I see that you un-deprecated the whole module in https://github.com/apache/airflow/commit/765d29ecc9fd6a3220efa0a6c4ce10848f5cbf82 - but not the class itself, that's why I added the ODBCHook. In this case, yes - we should remove the deprecation and this class is not needed any more




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735376829


   > My general complaint with this pylint check is that in general I disagree that it detects actual problems.
   
   I'll stick to my opinion that it actually detects bad "smells" in code architecture and solving them leads to nicer code that has less hidden, unnecessary dependencies and is easier to maintain as the code maintains much better "single responsibility" principle and avoids "god object" pattern. https://en.wikipedia.org/wiki/God_object. While using local imports might work, those dependencies (like in this case) are there, and make it more difficult to refactor and understand the code.
   
   But that's only my opinion not a universal truth, of course, coming from much better structured and stronger typing languages.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526511344



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       @ashb @mik-laj -> this is also for preparation to be able to use 3rd-party providers. I've added provider-package to the provider.yaml to avoid repetition and made all modules/classes relative. This way it is really safe - because we will not be able to point accidentally to a class outside of our package.
   
   I also added a check if provider package is properly named bu also if it starts with "airflow.providers". 
   
   Currently we only support airflow.providers anyway. 
   
   It might be good to discuss now and agree future conventions for those 3rd-parties eventually
   
   1) are we allowing any package to contain provider?
   If not:
   2a) should all providers live in airflow.providers
   2b) should they live in airflow.contrib.providers (or similar)
   if 2a) and 2b) - > should we allow the packages to be named not following the convention :
   
   apache-airflow-providers-<PROVIDER> (with '.' replaced by -). 
   
   OR maybe some other approach? WDYT?
   
   WDYT?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735375277


   > > Did something really detect that as a cycle? Cos importing `connection.py` would not cause the provider_manager to import anything from plugins_manager -- only running code would do that, not simply importing. Therefore that is not a cyclic import.
   > 
   > Yes. Pylint did and fail.
   
   https://github.com/PyCQA/pylint/issues/850  - anyway, Not a "real" cyclic import (or at least not on triggered at import time) but anyway, moving it out is cleanler


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-731813095


   Hey @ashb - we can talk about it tomorrow. I understand (I think) your point about connection parameters being part of Hook - but I woudl love to see if we can do it in a way that will be nice to discover the connections from providers and make it easy for non-provider (casual) users to possibly use the same kind of functionality.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531485590



##########
File path: airflow/cli/commands/provider_command.py
##########
@@ -0,0 +1,88 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Providers sub-commands"""
+from typing import Dict, List, Tuple
+
+import pygments
+import yaml
+from pygments.lexers.data import YamlLexer
+from tabulate import tabulate
+
+from airflow.providers_manager import ProvidersManager
+from airflow.utils.cli import should_use_colors
+from airflow.utils.code_utils import get_terminal_formatter
+
+
+def _tabulate_providers(providers: List[Dict], tablefmt: str):
+    tabulate_data = [
+        {
+            'Provider name': provider['package-name'],
+            'Description': provider['description'],
+            'Version': provider['versions'][0],
+        }
+        for version, provider in providers
+    ]
+
+    msg = tabulate(tabulate_data, tablefmt=tablefmt, headers='keys')
+    return msg
+
+
+def provider_get(args):
+    """Get a provider info."""
+    providers = ProvidersManager().providers
+    if args.provider_name in providers:
+        provider_version = providers[args.provider_name][0]
+        provider_info = providers[args.provider_name][1]

Review comment:
       ```suggestion
           provider_version, provider_info = providers[args.provider_name]
   ```

##########
File path: airflow/models/connection.py
##########
@@ -30,69 +31,20 @@
 from airflow.exceptions import AirflowException, AirflowNotFoundException
 from airflow.models.base import ID_LEN, Base
 from airflow.models.crypto import get_fernet
+from airflow.providers_manager import ProvidersManager
 from airflow.utils.log.logging_mixin import LoggingMixin
 from airflow.utils.module_loading import import_string
 
-# A map that assigns a connection type to a tuple that contains
-# the path of the class and the name of the conn_id key parameter.
-# PLEASE KEEP BELOW LIST IN ALPHABETICAL ORDER.
-CONN_TYPE_TO_HOOK = {
-    "azure_batch": (
-        "airflow.providers.microsoft.azure.hooks.azure_batch.AzureBatchHook",
-        "azure_batch_conn_id",
-    ),
-    "azure_cosmos": (
-        "airflow.providers.microsoft.azure.hooks.azure_cosmos.AzureCosmosDBHook",
-        "azure_cosmos_conn_id",
-    ),
-    "azure_data_lake": (
-        "airflow.providers.microsoft.azure.hooks.azure_data_lake.AzureDataLakeHook",
-        "azure_data_lake_conn_id",
-    ),
-    "cassandra": ("airflow.providers.apache.cassandra.hooks.cassandra.CassandraHook", "cassandra_conn_id"),
-    "cloudant": ("airflow.providers.cloudant.hooks.cloudant.CloudantHook", "cloudant_conn_id"),
-    "dataprep": ("airflow.providers.google.cloud.hooks.dataprep.GoogleDataprepHook", "dataprep_default"),
-    "docker": ("airflow.providers.docker.hooks.docker.DockerHook", "docker_conn_id"),
-    "elasticsearch": (
-        "airflow.providers.elasticsearch.hooks.elasticsearch.ElasticsearchHook",
-        "elasticsearch_conn_id",
-    ),
-    "exasol": ("airflow.providers.exasol.hooks.exasol.ExasolHook", "exasol_conn_id"),
-    "gcpcloudsql": (
-        "airflow.providers.google.cloud.hooks.cloud_sql.CloudSQLDatabaseHook",
-        "gcp_cloudsql_conn_id",
-    ),
-    "gcpssh": (
-        "airflow.providers.google.cloud.hooks.compute_ssh.ComputeEngineSSHHook",
-        "gcp_conn_id",
-    ),
-    "google_cloud_platform": (
-        "airflow.providers.google.cloud.hooks.bigquery.BigQueryHook",
-        "bigquery_conn_id",
-    ),
-    "grpc": ("airflow.providers.grpc.hooks.grpc.GrpcHook", "grpc_conn_id"),
-    "hive_cli": ("airflow.providers.apache.hive.hooks.hive.HiveCliHook", "hive_cli_conn_id"),
-    "hiveserver2": ("airflow.providers.apache.hive.hooks.hive.HiveServer2Hook", "hiveserver2_conn_id"),
-    "imap": ("airflow.providers.imap.hooks.imap.ImapHook", "imap_conn_id"),
-    "jdbc": ("airflow.providers.jdbc.hooks.jdbc.JdbcHook", "jdbc_conn_id"),
-    "jira": ("airflow.providers.jira.hooks.jira.JiraHook", "jira_conn_id"),
-    "kubernetes": ("airflow.providers.cncf.kubernetes.hooks.kubernetes.KubernetesHook", "kubernetes_conn_id"),
-    "mongo": ("airflow.providers.mongo.hooks.mongo.MongoHook", "conn_id"),
-    "mssql": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "mysql": ("airflow.providers.mysql.hooks.mysql.MySqlHook", "mysql_conn_id"),
-    "odbc": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "oracle": ("airflow.providers.oracle.hooks.oracle.OracleHook", "oracle_conn_id"),
-    "pig_cli": ("airflow.providers.apache.pig.hooks.pig.PigCliHook", "pig_cli_conn_id"),
-    "postgres": ("airflow.providers.postgres.hooks.postgres.PostgresHook", "postgres_conn_id"),
-    "presto": ("airflow.providers.presto.hooks.presto.PrestoHook", "presto_conn_id"),
-    "redis": ("airflow.providers.redis.hooks.redis.RedisHook", "redis_conn_id"),
-    "snowflake": ("airflow.providers.snowflake.hooks.snowflake.SnowflakeHook", "snowflake_conn_id"),
-    "sqlite": ("airflow.providers.sqlite.hooks.sqlite.SqliteHook", "sqlite_conn_id"),
-    "tableau": ("airflow.providers.salesforce.hooks.tableau.TableauHook", "tableau_conn_id"),
-    "vertica": ("airflow.providers.vertica.hooks.vertica.VerticaHook", "vertica_conn_id"),
-    "wasb": ("airflow.providers.microsoft.azure.hooks.wasb.WasbHook", "wasb_conn_id"),
-}
-# PLEASE KEEP ABOVE LIST IN ALPHABETICAL ORDER.
+
+@lru_cache(maxsize=None)
+def get_conn_type_to_hook() -> Dict[str, Tuple[str, str]]:

Review comment:
       ```suggestion
   def get_conn_type_to_hook_mappping() -> Dict[str, Tuple[str, str]]:
   ```
   
   As an English speakger `get_conn_type_to_hook` reads to me like it should be used as `get_conn_type_to_hook(self.conn_type)`, so slight tweak to the function name please.
   
   That said, given this is a one line function that just accesses a property on a singleton object, do we even need the function?

##########
File path: dev/provider_packages/SETUP_TEMPLATE.py.jinja2
##########
@@ -59,19 +59,6 @@ def do_setup(version_suffix_for_pypi=''):
         packages=find_namespace_packages(
             include=['airflow.providers.{{ PROVIDER_PACKAGE_ID }}',
                      'airflow.providers.{{ PROVIDER_PACKAGE_ID }}.*']),
-        package_data={ '': [
-{% if PROVIDER_PACKAGE_ID == 'amazon' %}
-          "*.json",
-{% elif PROVIDER_PACKAGE_ID == 'cncf.kubernetes' %}
-          "*.yaml",
-{% elif PROVIDER_PACKAGE_ID == 'google' %}
-          "*.yaml",
-          "*.sql",
-{% elif PROVIDER_PACKAGE_ID == 'papermill' %}
-          "*.ipynb",
-{% endif %}
-            ],
-        },

Review comment:
       This is needed I think, otherwise the datafiles are not included in built wheels.

##########
File path: airflow/providers_manager.py
##########
@@ -36,52 +39,231 @@
 
 
 def _create_validator():
+    """Creates JSON schema validator from the provider.yaml.schema.json"""
     schema = json.loads(importlib_resources.read_text('airflow', 'provider.yaml.schema.json'))
     cls = jsonschema.validators.validator_for(schema)
     validator = cls(schema)
     return validator
 
 
 class ProvidersManager:
-    """Manages all provider packages."""
+    """
+    Manages all provider packages. This is a Singleton class. The first time it is
+    instantiated, it discovers all available providers in installed packages and
+    local source folders (if airflow is run from sources).
+    """
+
+    _instance = None
+    resource_version = "0"
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
 
     def __init__(self):
-        self._provider_directory = {}
-        try:
-            from airflow import providers
-        except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
-            return
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
+        self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
-        self.__find_all_providers(providers.__path__)
+        # Local source folders are loaded first. They should take precedence over the package ones for
+        # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
+        # So there is no risk we are going to override package provider accidentally. This can only happen
+        # in case of local development
+        self._discover_all_airflow_builtin_providers_from_local_sources()
+        self._discover_all_providers_from_packages()
+        self._discover_hooks()
+        self._sort_provider_dictionary()
+        self._sort_hooks_dictionary()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    def _sort_hooks_dictionary(self):
+        """
+        Creates provider_directory as sorted (by package_name) OrderedDict.
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
+        Duplicates are removed from "package" providers in case corresponding "folder" provider is found.
+        The "folder" providers are from local sources (packages do not contain provider.yaml files),
+        so if someone has airflow installed from local sources, the providers are imported from there
+        first so, provider information should be taken from there.
+        :return:
+        """
+        sorted_dict = OrderedDict()
+        for connection_type in sorted(self._hooks_dict.keys()):
+            sorted_dict[connection_type] = self._hooks_dict[connection_type]
+        self._hooks_dict = sorted_dict
+
+    def _sort_provider_dictionary(self):
+        """
+        Sort provider_dictionary using OrderedDict.
+
+        The dictionary gets sorted so that when you iterate through it, the providers are by
+        default returned in alphabetical order.
+        """
+        sorted_dict = OrderedDict()
+        for provider_name in sorted(self._provider_dict.keys()):
+            sorted_dict[provider_name] = self._provider_dict[provider_name]
+        self._provider_dict = sorted_dict
+
+    def _discover_all_providers_from_packages(self) -> None:
+        """
+        Discovers all providers by scanning packages installed. The list of providers should be returned
+        via the 'apache_airflow_provider' entrypoint as a dictionary conforming to the
+        'airflow/provider.yaml.schema.json' schema.
+        """
+        for entry_point in pkg_resources.iter_entry_points('apache_airflow_provider'):
+            package_name = entry_point.dist.project_name
+            log.debug("Loading %s from package %s", entry_point, package_name)
+            version = entry_point.dist.version
             try:
-                imported_module = importlib.import_module(module_info.name)
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when importing %s:%s", module_info.name, e)
+                provider_info = entry_point.load()()
+            except pkg_resources.VersionConflict as e:
+                log.warning(
+                    "The provider package %s could not be registered because of version conflict : %s",
+                    package_name,
+                    e,
+                )
                 continue
-            try:
-                provider = importlib_resources.read_text(imported_module, 'provider.yaml')
-                provider_info = yaml.safe_load(provider)
-                self._validator.validate(provider_info)
-                self._provider_directory[provider_info['package-name']] = provider_info
-            except FileNotFoundError:
-                # This is OK - this is not a provider package
-                pass
-            except TypeError as e:
-                if "is not a package" not in str(e):
-                    log.warning("Error when loading 'provider.yaml' file from %s:%s}", module_info.name, e)
-                # Otherwise this is OK - this is likely a module
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when loading 'provider.yaml' file from %s:%s", module_info.name, e)
+            self._validator.validate(provider_info)
+            provider_info_package_name = provider_info['package-name']
+            if package_name != provider_info_package_name:
+                raise Exception(
+                    f"The package '{package_name}' from setuptools and "
+                    f"{provider_info_package_name} do not match. Please make sure they are aligned"
+                )
+            if package_name not in self._provider_dict:
+                self._provider_dict[package_name] = (version, provider_info)
+            else:
+                log.warning(
+                    "The provider for package '%s' could not be registered from because providers for that "
+                    "package name have already been registered",
+                    package_name,
+                )
+
+    def _discover_all_airflow_builtin_providers_from_local_sources(self) -> None:
+        """
+        Finds all built-in airflow providers if airflow is run from the local sources.
+        It finds `provider.yaml` files for all such providers and registers the providers using those.
+
+        This 'provider.yaml' scanning takes precedence over scanning packages installed
+        in case you have both sources and packages installed, the providers will be loaded from
+        the "airflow" sources rather than from the packages.
+        """
+        try:
+            import airflow.providers
+        except ImportError:
+            log.info("You have no providers installed.")
+            return
+        try:
+            for path in airflow.providers.__path__:
+                self._add_provider_info_from_local_source_files_on_path(path)
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning("Error when loading 'provider.yaml' files from airflow sources: %s", e)
+
+    def _add_provider_info_from_local_source_files_on_path(self, path) -> None:
+        """
+        Finds all the provider.yaml files in the directory specified.
+
+        :param path: path where to look for provider.yaml files
+        """
+        root_path = path
+        for folder, subdirs, files in os.walk(path, topdown=True):
+            for filename in fnmatch.filter(files, "provider.yaml"):
+                package_name = "apache-airflow-providers" + folder[len(root_path) :].replace(os.sep, "-")
+                self._add_provider_info_from_local_source_file(os.path.join(folder, filename), package_name)
+                subdirs[:] = []
+
+    def _add_provider_info_from_local_source_file(self, path, package_name) -> None:
+        """
+        Parses found provider.yaml file and adds found provider to the dictionary.
+
+        :param path: full file path of the provider.yaml file
+        :param package_name: name of the package
+        """
+        try:
+            log.debug("Loading %s from %s", package_name, path)
+            with open(path) as provider_yaml_file:
+                provider_info = yaml.safe_load(provider_yaml_file)
+            self._validator.validate(provider_info)
+
+            version = provider_info['versions'][0]
+            if package_name not in self._provider_dict:
+                self._provider_dict[package_name] = (version, provider_info)
+            else:
+                log.warning(
+                    "The providers for package '%s' could not be registered because providers for that "
+                    "package name have already been registered",
+                    package_name,
+                )
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning("Error when loading '%s': %s", path, e)
+
+    def _discover_hooks(self) -> None:
+        """Retrieves all connections defined in the providers"""
+        for name, provider in self._provider_dict.items():
+            provider_package = name
+            hook_class_names = provider[1].get("hook-class-names")
+            if hook_class_names:
+                for hook_class_name in hook_class_names:
+                    self._add_hook(hook_class_name, provider_package)
+
+    def _add_hook(self, hook_class_name, provider_package) -> None:
+        """
+        Adds hook class name to list of hooks
+
+        :param hook_class_name: name of the Hook class
+        :param provider_package: provider package adding the hook
+        """
+        if provider_package.startswith("apache-airflow"):
+            provider_path = provider_package[len("apache-") :].replace("-", ".")
+            if not hook_class_name.startswith(provider_path):
+                log.warning(
+                    "Sanity check failed when importing '%s' from '%s' package. It should start with '%s'",
+                    hook_class_name,
+                    provider_package,
+                    provider_path,
+                )
+                return
+        if hook_class_name in self._hooks_dict:
+            log.warning(
+                "The hook_class '%s' has been already registered.",
+                hook_class_name,
+            )
+            return
+        try:
+            module, class_name = hook_class_name.rsplit('.', maxsplit=1)
+            hook_class = getattr(importlib.import_module(module), class_name)
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning(
+                "Exception when importing '%s' from '%s' package: %s",
+                hook_class_name,
+                provider_package,
+                e,
+            )
+            return
+        conn_type = getattr(hook_class, 'conn_type')
+        if not conn_type:
+            log.warning(
+                "The hook_class '%s' misses connection_type attribute and cannot be registered",
+                hook_class,
+            )
+            return
+        connection_if_attribute_name = getattr(hook_class, 'conn_name_attr')
+        if not connection_if_attribute_name:
+            log.warning(
+                "The hook_class '%s' misses conn_name_attr attribute and cannot be registered",
+                hook_class,
+            )
+            return
+
+        self._hooks_dict[conn_type] = (hook_class_name, connection_if_attribute_name)

Review comment:
       ```suggestion
           connection_id_attribute_name = getattr(hook_class, 'conn_name_attr')
           if not connection_id_attribute_name:
               log.warning(
                   "The hook_class '%s' is missing `conn_name_attr` attribute and cannot be registered",
                   hook_class,
               )
               return
   
           self._hooks_dict[conn_type] = (hook_class_name, connection_id_attribute_name)
   ```

##########
File path: airflow/providers_manager.py
##########
@@ -36,52 +39,231 @@
 
 
 def _create_validator():
+    """Creates JSON schema validator from the provider.yaml.schema.json"""
     schema = json.loads(importlib_resources.read_text('airflow', 'provider.yaml.schema.json'))
     cls = jsonschema.validators.validator_for(schema)
     validator = cls(schema)
     return validator
 
 
 class ProvidersManager:
-    """Manages all provider packages."""
+    """
+    Manages all provider packages. This is a Singleton class. The first time it is
+    instantiated, it discovers all available providers in installed packages and
+    local source folders (if airflow is run from sources).
+    """
+
+    _instance = None
+    resource_version = "0"
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
 
     def __init__(self):
-        self._provider_directory = {}
-        try:
-            from airflow import providers
-        except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
-            return
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
+        self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
-        self.__find_all_providers(providers.__path__)
+        # Local source folders are loaded first. They should take precedence over the package ones for
+        # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
+        # So there is no risk we are going to override package provider accidentally. This can only happen
+        # in case of local development
+        self._discover_all_airflow_builtin_providers_from_local_sources()
+        self._discover_all_providers_from_packages()
+        self._discover_hooks()
+        self._sort_provider_dictionary()
+        self._sort_hooks_dictionary()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    def _sort_hooks_dictionary(self):
+        """
+        Creates provider_directory as sorted (by package_name) OrderedDict.
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
+        Duplicates are removed from "package" providers in case corresponding "folder" provider is found.
+        The "folder" providers are from local sources (packages do not contain provider.yaml files),
+        so if someone has airflow installed from local sources, the providers are imported from there
+        first so, provider information should be taken from there.
+        :return:
+        """
+        sorted_dict = OrderedDict()
+        for connection_type in sorted(self._hooks_dict.keys()):
+            sorted_dict[connection_type] = self._hooks_dict[connection_type]
+        self._hooks_dict = sorted_dict
+
+    def _sort_provider_dictionary(self):
+        """
+        Sort provider_dictionary using OrderedDict.
+
+        The dictionary gets sorted so that when you iterate through it, the providers are by
+        default returned in alphabetical order.
+        """
+        sorted_dict = OrderedDict()
+        for provider_name in sorted(self._provider_dict.keys()):
+            sorted_dict[provider_name] = self._provider_dict[provider_name]
+        self._provider_dict = sorted_dict
+
+    def _discover_all_providers_from_packages(self) -> None:
+        """
+        Discovers all providers by scanning packages installed. The list of providers should be returned
+        via the 'apache_airflow_provider' entrypoint as a dictionary conforming to the
+        'airflow/provider.yaml.schema.json' schema.
+        """
+        for entry_point in pkg_resources.iter_entry_points('apache_airflow_provider'):
+            package_name = entry_point.dist.project_name
+            log.debug("Loading %s from package %s", entry_point, package_name)
+            version = entry_point.dist.version
             try:
-                imported_module = importlib.import_module(module_info.name)
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when importing %s:%s", module_info.name, e)
+                provider_info = entry_point.load()()
+            except pkg_resources.VersionConflict as e:
+                log.warning(
+                    "The provider package %s could not be registered because of version conflict : %s",
+                    package_name,
+                    e,
+                )
                 continue
-            try:
-                provider = importlib_resources.read_text(imported_module, 'provider.yaml')
-                provider_info = yaml.safe_load(provider)
-                self._validator.validate(provider_info)
-                self._provider_directory[provider_info['package-name']] = provider_info
-            except FileNotFoundError:
-                # This is OK - this is not a provider package
-                pass
-            except TypeError as e:
-                if "is not a package" not in str(e):
-                    log.warning("Error when loading 'provider.yaml' file from %s:%s}", module_info.name, e)
-                # Otherwise this is OK - this is likely a module
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when loading 'provider.yaml' file from %s:%s", module_info.name, e)
+            self._validator.validate(provider_info)
+            provider_info_package_name = provider_info['package-name']
+            if package_name != provider_info_package_name:
+                raise Exception(
+                    f"The package '{package_name}' from setuptools and "
+                    f"{provider_info_package_name} do not match. Please make sure they are aligned"
+                )
+            if package_name not in self._provider_dict:
+                self._provider_dict[package_name] = (version, provider_info)
+            else:
+                log.warning(
+                    "The provider for package '%s' could not be registered from because providers for that "
+                    "package name have already been registered",
+                    package_name,
+                )
+
+    def _discover_all_airflow_builtin_providers_from_local_sources(self) -> None:
+        """
+        Finds all built-in airflow providers if airflow is run from the local sources.
+        It finds `provider.yaml` files for all such providers and registers the providers using those.
+
+        This 'provider.yaml' scanning takes precedence over scanning packages installed
+        in case you have both sources and packages installed, the providers will be loaded from
+        the "airflow" sources rather than from the packages.
+        """
+        try:
+            import airflow.providers
+        except ImportError:
+            log.info("You have no providers installed.")
+            return
+        try:
+            for path in airflow.providers.__path__:
+                self._add_provider_info_from_local_source_files_on_path(path)
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning("Error when loading 'provider.yaml' files from airflow sources: %s", e)
+
+    def _add_provider_info_from_local_source_files_on_path(self, path) -> None:
+        """
+        Finds all the provider.yaml files in the directory specified.
+
+        :param path: path where to look for provider.yaml files
+        """
+        root_path = path
+        for folder, subdirs, files in os.walk(path, topdown=True):
+            for filename in fnmatch.filter(files, "provider.yaml"):
+                package_name = "apache-airflow-providers" + folder[len(root_path) :].replace(os.sep, "-")
+                self._add_provider_info_from_local_source_file(os.path.join(folder, filename), package_name)
+                subdirs[:] = []
+
+    def _add_provider_info_from_local_source_file(self, path, package_name) -> None:
+        """
+        Parses found provider.yaml file and adds found provider to the dictionary.
+
+        :param path: full file path of the provider.yaml file
+        :param package_name: name of the package
+        """
+        try:
+            log.debug("Loading %s from %s", package_name, path)
+            with open(path) as provider_yaml_file:
+                provider_info = yaml.safe_load(provider_yaml_file)
+            self._validator.validate(provider_info)
+
+            version = provider_info['versions'][0]
+            if package_name not in self._provider_dict:
+                self._provider_dict[package_name] = (version, provider_info)
+            else:
+                log.warning(
+                    "The providers for package '%s' could not be registered because providers for that "
+                    "package name have already been registered",
+                    package_name,
+                )
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning("Error when loading '%s': %s", path, e)
+
+    def _discover_hooks(self) -> None:
+        """Retrieves all connections defined in the providers"""
+        for name, provider in self._provider_dict.items():
+            provider_package = name
+            hook_class_names = provider[1].get("hook-class-names")
+            if hook_class_names:
+                for hook_class_name in hook_class_names:
+                    self._add_hook(hook_class_name, provider_package)
+
+    def _add_hook(self, hook_class_name, provider_package) -> None:
+        """
+        Adds hook class name to list of hooks
+
+        :param hook_class_name: name of the Hook class
+        :param provider_package: provider package adding the hook
+        """
+        if provider_package.startswith("apache-airflow"):
+            provider_path = provider_package[len("apache-") :].replace("-", ".")
+            if not hook_class_name.startswith(provider_path):
+                log.warning(
+                    "Sanity check failed when importing '%s' from '%s' package. It should start with '%s'",
+                    hook_class_name,
+                    provider_package,
+                    provider_path,
+                )
+                return
+        if hook_class_name in self._hooks_dict:
+            log.warning(
+                "The hook_class '%s' has been already registered.",
+                hook_class_name,
+            )
+            return
+        try:
+            module, class_name = hook_class_name.rsplit('.', maxsplit=1)
+            hook_class = getattr(importlib.import_module(module), class_name)
+        except Exception as e:  # noqa pylint: disable=broad-except
+            log.warning(
+                "Exception when importing '%s' from '%s' package: %s",
+                hook_class_name,
+                provider_package,
+                e,
+            )
+            return
+        conn_type = getattr(hook_class, 'conn_type')
+        if not conn_type:
+            log.warning(
+                "The hook_class '%s' misses connection_type attribute and cannot be registered",

Review comment:
       ```suggestion
                   "The hook_class '%s' is missing `connection_type` attribute and cannot be registered",
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735349055


   @potiuk I agree with Jarek. Let's focus on making the necessary changes to the core. Additional improvements in providers, we can make regardless of core changes.  We can now easily update them independently from the core, so it will be a very nice task as a good-first-issue.  I imagine each provider will have described operators, connection configurations, and other components if they have them.
   A good example, but not a perfect one, is the [`apache-airflow-providers-google `](http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow-providers-google/latest/index.html)package.
   
   If we now have all the information defined in the provider.yaml file, it will be very easy to create a ticket that will describe the rules of what to do and each person will be able to incrementally fill in missing documentation and other missing things.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r527275948



##########
File path: airflow/providers/apache/cassandra/provider.yaml
##########
@@ -33,10 +35,15 @@ integrations:
 sensors:
   - integration-name: Apache Cassandra
     python-modules:
-      - airflow.providers.apache.cassandra.sensors.record
-      - airflow.providers.apache.cassandra.sensors.table
+      - .sensors.record
+      - .sensors.table
 
 hooks:
   - integration-name: Apache Cassandra
     python-modules:
-      - airflow.providers.apache.cassandra.hooks.cassandra
+      - .hooks.cassandra
+
+connections:
+  - connection-type: cassandra
+    hook-class: .hooks.cassandra.CassandraHook
+    connection-id-parameter-name: cassandra_conn_id

Review comment:
       There are many other things that are not yet defined here (other PRs are addressing it) - extra links and connection in forms in javascript. If you look at those, you will understand that those cannot be as easily incorporated in the classes. 
   
   I prefer to keep "provider delivered functionality" in one place rather than some of it in .yaml, some of this in some connection and some of this in other classes.
   
   Also, our proposal is far from complete - it does not actually solve the problem of discovering those Hooks (unless you have not posted it. Even with the entry_points mechanism, you will have to somehow at least keep the list of hooks in each provider (there are at least 4 in the google provider). That would mean that you would have to not only keep the list of hooks but also remember to update the  "extra" information in the hooks that you want to expose- this will have to be (by its nature) split between two places. 
   
   What was your idea here? where did you want to keep the list of connections to load?
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-731951567


   [The Workflow run](https://github.com/apache/airflow/actions/runs/378334877) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532061630



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {

Review comment:
       However, if we have a situation where some classes require a certain set of attributes and others do not, it may be a good opportunity for refacotrization. I guess this one class has two purpose now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532061263



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {

Review comment:
       Exactly. I will add a complete documentation on "Writing your own providerr" after we merge it all, so that I do not loose time on describing things that might change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-731991952


   [The Workflow run](https://github.com/apache/airflow/actions/runs/378468811) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532077592



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -125,6 +127,7 @@
             "description": "Integration name. It must have a matching item in the 'integration' section of any provider."
           },
           "python-modules": {
+            "description": "Name of the module.",

Review comment:
       Here we have a similar problem as we had below. Since it's an array, we should use the plural.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532061625



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {
+      "type": "array",
+      "items": {
+        "type": "string",
+        "description": "Hooks class names that provide connection types to core"

Review comment:
       What's the difference between the two in terms of UI/validation? Should we describe item or array and what happens if we do both?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
kaxil commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531151327



##########
File path: airflow/models/connection.py
##########
@@ -30,69 +31,22 @@
 from airflow.exceptions import AirflowException, AirflowNotFoundException
 from airflow.models.base import ID_LEN, Base
 from airflow.models.crypto import get_fernet
+from airflow.providers_manager import ProvidersManager
 from airflow.utils.log.logging_mixin import LoggingMixin
 from airflow.utils.module_loading import import_string
 
-# A map that assigns a connection type to a tuple that contains
-# the path of the class and the name of the conn_id key parameter.
-# PLEASE KEEP BELOW LIST IN ALPHABETICAL ORDER.
-CONN_TYPE_TO_HOOK = {
-    "azure_batch": (
-        "airflow.providers.microsoft.azure.hooks.azure_batch.AzureBatchHook",
-        "azure_batch_conn_id",
-    ),
-    "azure_cosmos": (
-        "airflow.providers.microsoft.azure.hooks.azure_cosmos.AzureCosmosDBHook",
-        "azure_cosmos_conn_id",
-    ),
-    "azure_data_lake": (
-        "airflow.providers.microsoft.azure.hooks.azure_data_lake.AzureDataLakeHook",
-        "azure_data_lake_conn_id",
-    ),
-    "cassandra": ("airflow.providers.apache.cassandra.hooks.cassandra.CassandraHook", "cassandra_conn_id"),
-    "cloudant": ("airflow.providers.cloudant.hooks.cloudant.CloudantHook", "cloudant_conn_id"),
-    "dataprep": ("airflow.providers.google.cloud.hooks.dataprep.GoogleDataprepHook", "dataprep_default"),
-    "docker": ("airflow.providers.docker.hooks.docker.DockerHook", "docker_conn_id"),
-    "elasticsearch": (
-        "airflow.providers.elasticsearch.hooks.elasticsearch.ElasticsearchHook",
-        "elasticsearch_conn_id",
-    ),
-    "exasol": ("airflow.providers.exasol.hooks.exasol.ExasolHook", "exasol_conn_id"),
-    "gcpcloudsql": (
-        "airflow.providers.google.cloud.hooks.cloud_sql.CloudSQLDatabaseHook",
-        "gcp_cloudsql_conn_id",
-    ),
-    "gcpssh": (
-        "airflow.providers.google.cloud.hooks.compute_ssh.ComputeEngineSSHHook",
-        "gcp_conn_id",
-    ),
-    "google_cloud_platform": (
-        "airflow.providers.google.cloud.hooks.bigquery.BigQueryHook",
-        "bigquery_conn_id",
-    ),
-    "grpc": ("airflow.providers.grpc.hooks.grpc.GrpcHook", "grpc_conn_id"),
-    "hive_cli": ("airflow.providers.apache.hive.hooks.hive.HiveCliHook", "hive_cli_conn_id"),
-    "hiveserver2": ("airflow.providers.apache.hive.hooks.hive.HiveServer2Hook", "hiveserver2_conn_id"),
-    "imap": ("airflow.providers.imap.hooks.imap.ImapHook", "imap_conn_id"),
-    "jdbc": ("airflow.providers.jdbc.hooks.jdbc.JdbcHook", "jdbc_conn_id"),
-    "jira": ("airflow.providers.jira.hooks.jira.JiraHook", "jira_conn_id"),
-    "kubernetes": ("airflow.providers.cncf.kubernetes.hooks.kubernetes.KubernetesHook", "kubernetes_conn_id"),
-    "mongo": ("airflow.providers.mongo.hooks.mongo.MongoHook", "conn_id"),
-    "mssql": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "mysql": ("airflow.providers.mysql.hooks.mysql.MySqlHook", "mysql_conn_id"),
-    "odbc": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "oracle": ("airflow.providers.oracle.hooks.oracle.OracleHook", "oracle_conn_id"),
-    "pig_cli": ("airflow.providers.apache.pig.hooks.pig.PigCliHook", "pig_cli_conn_id"),
-    "postgres": ("airflow.providers.postgres.hooks.postgres.PostgresHook", "postgres_conn_id"),
-    "presto": ("airflow.providers.presto.hooks.presto.PrestoHook", "presto_conn_id"),
-    "redis": ("airflow.providers.redis.hooks.redis.RedisHook", "redis_conn_id"),
-    "snowflake": ("airflow.providers.snowflake.hooks.snowflake.SnowflakeHook", "snowflake_conn_id"),
-    "sqlite": ("airflow.providers.sqlite.hooks.sqlite.SqliteHook", "sqlite_conn_id"),
-    "tableau": ("airflow.providers.salesforce.hooks.tableau.TableauHook", "tableau_conn_id"),
-    "vertica": ("airflow.providers.vertica.hooks.vertica.VerticaHook", "vertica_conn_id"),
-    "wasb": ("airflow.providers.microsoft.azure.hooks.wasb.WasbHook", "wasb_conn_id"),
-}
-# PLEASE KEEP ABOVE LIST IN ALPHABETICAL ORDER.
+cache = lru_cache(maxsize=None)
+
+
+@cache

Review comment:
       ```suggestion
   
   @lru_cache(maxsize=None)
   ```
   
   Since it is just at a single place

##########
File path: airflow/cli/commands/provider_command.py
##########
@@ -0,0 +1,88 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Providers sub-commands"""
+from typing import Dict, List, Tuple
+
+import pygments
+import yaml
+from pygments.lexers.data import YamlLexer
+from tabulate import tabulate
+
+from airflow.providers_manager import ProvidersManager
+from airflow.utils.cli import should_use_colors
+from airflow.utils.code_utils import get_terminal_formatter
+
+
+def _tabulate_providers(providers: List[Dict], tablefmt: str):
+    tabulate_data = [
+        {
+            'Provider name': provider['package-name'],
+            'Description': provider['description'],
+            'Version': provider['versions'][0],
+        }
+        for version, provider in providers
+    ]
+
+    msg = tabulate(tabulate_data, tablefmt=tablefmt, headers='keys')
+    return msg
+
+
+def provider_get(args):
+    """Get a provider info."""
+    providers = ProvidersManager().providers
+    if args.provider_name in providers:
+        provider_version = providers[args.provider_name][0]
+        provider_info = providers[args.provider_name][1]
+        print("#")
+        print(f"# Provider: {args.provider_name}")
+        print(f"# Version: {provider_version}")
+        print("#")
+        if args.full:
+            yaml_content = yaml.dump(provider_info)
+            if should_use_colors(args):
+                yaml_content = pygments.highlight(
+                    code=yaml_content, formatter=get_terminal_formatter(), lexer=YamlLexer()
+                )
+            print(yaml_content)
+    else:
+        raise SystemExit(f"No such provider installed: {args.provider_name}")
+
+
+def providers_list(args):
+    """Lists all providers at the command line"""
+    msg = _tabulate_providers(ProvidersManager().providers.values(), args.output)
+    print(msg)
+
+
+def _tabulate_hooks(hook_items: Tuple[str, Tuple[str, str]], tablefmt: str):
+    tabulate_data = [
+        {
+            'Connection type': hook_item[0],
+            'Hook class': hook_item[1][0],
+            'Hook connection attribute nme': hook_item[1][1],

Review comment:
       ```suggestion
               'Hook connection attribute name': hook_item[1][1],
   ```

##########
File path: airflow/models/connection.py
##########
@@ -30,69 +31,22 @@
 from airflow.exceptions import AirflowException, AirflowNotFoundException
 from airflow.models.base import ID_LEN, Base
 from airflow.models.crypto import get_fernet
+from airflow.providers_manager import ProvidersManager
 from airflow.utils.log.logging_mixin import LoggingMixin
 from airflow.utils.module_loading import import_string
 
-# A map that assigns a connection type to a tuple that contains
-# the path of the class and the name of the conn_id key parameter.
-# PLEASE KEEP BELOW LIST IN ALPHABETICAL ORDER.
-CONN_TYPE_TO_HOOK = {
-    "azure_batch": (
-        "airflow.providers.microsoft.azure.hooks.azure_batch.AzureBatchHook",
-        "azure_batch_conn_id",
-    ),
-    "azure_cosmos": (
-        "airflow.providers.microsoft.azure.hooks.azure_cosmos.AzureCosmosDBHook",
-        "azure_cosmos_conn_id",
-    ),
-    "azure_data_lake": (
-        "airflow.providers.microsoft.azure.hooks.azure_data_lake.AzureDataLakeHook",
-        "azure_data_lake_conn_id",
-    ),
-    "cassandra": ("airflow.providers.apache.cassandra.hooks.cassandra.CassandraHook", "cassandra_conn_id"),
-    "cloudant": ("airflow.providers.cloudant.hooks.cloudant.CloudantHook", "cloudant_conn_id"),
-    "dataprep": ("airflow.providers.google.cloud.hooks.dataprep.GoogleDataprepHook", "dataprep_default"),
-    "docker": ("airflow.providers.docker.hooks.docker.DockerHook", "docker_conn_id"),
-    "elasticsearch": (
-        "airflow.providers.elasticsearch.hooks.elasticsearch.ElasticsearchHook",
-        "elasticsearch_conn_id",
-    ),
-    "exasol": ("airflow.providers.exasol.hooks.exasol.ExasolHook", "exasol_conn_id"),
-    "gcpcloudsql": (
-        "airflow.providers.google.cloud.hooks.cloud_sql.CloudSQLDatabaseHook",
-        "gcp_cloudsql_conn_id",
-    ),
-    "gcpssh": (
-        "airflow.providers.google.cloud.hooks.compute_ssh.ComputeEngineSSHHook",
-        "gcp_conn_id",
-    ),
-    "google_cloud_platform": (
-        "airflow.providers.google.cloud.hooks.bigquery.BigQueryHook",
-        "bigquery_conn_id",
-    ),
-    "grpc": ("airflow.providers.grpc.hooks.grpc.GrpcHook", "grpc_conn_id"),
-    "hive_cli": ("airflow.providers.apache.hive.hooks.hive.HiveCliHook", "hive_cli_conn_id"),
-    "hiveserver2": ("airflow.providers.apache.hive.hooks.hive.HiveServer2Hook", "hiveserver2_conn_id"),
-    "imap": ("airflow.providers.imap.hooks.imap.ImapHook", "imap_conn_id"),
-    "jdbc": ("airflow.providers.jdbc.hooks.jdbc.JdbcHook", "jdbc_conn_id"),
-    "jira": ("airflow.providers.jira.hooks.jira.JiraHook", "jira_conn_id"),
-    "kubernetes": ("airflow.providers.cncf.kubernetes.hooks.kubernetes.KubernetesHook", "kubernetes_conn_id"),
-    "mongo": ("airflow.providers.mongo.hooks.mongo.MongoHook", "conn_id"),
-    "mssql": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "mysql": ("airflow.providers.mysql.hooks.mysql.MySqlHook", "mysql_conn_id"),
-    "odbc": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "oracle": ("airflow.providers.oracle.hooks.oracle.OracleHook", "oracle_conn_id"),
-    "pig_cli": ("airflow.providers.apache.pig.hooks.pig.PigCliHook", "pig_cli_conn_id"),
-    "postgres": ("airflow.providers.postgres.hooks.postgres.PostgresHook", "postgres_conn_id"),
-    "presto": ("airflow.providers.presto.hooks.presto.PrestoHook", "presto_conn_id"),
-    "redis": ("airflow.providers.redis.hooks.redis.RedisHook", "redis_conn_id"),
-    "snowflake": ("airflow.providers.snowflake.hooks.snowflake.SnowflakeHook", "snowflake_conn_id"),
-    "sqlite": ("airflow.providers.sqlite.hooks.sqlite.SqliteHook", "sqlite_conn_id"),
-    "tableau": ("airflow.providers.salesforce.hooks.tableau.TableauHook", "tableau_conn_id"),
-    "vertica": ("airflow.providers.vertica.hooks.vertica.VerticaHook", "vertica_conn_id"),
-    "wasb": ("airflow.providers.microsoft.azure.hooks.wasb.WasbHook", "wasb_conn_id"),
-}
-# PLEASE KEEP ABOVE LIST IN ALPHABETICAL ORDER.
+cache = lru_cache(maxsize=None)
+
+
+@cache
+def get_conn_type_to_hook() -> Dict[str, Tuple[str, str]]:
+    """
+    Return A map that assigns a connection type to a tuple that contains

Review comment:
       ```suggestion
       Returns a map that assigns a connection type to a tuple that contains
   ```

##########
File path: airflow/providers_manager.py
##########
@@ -36,52 +39,232 @@
 
 
 def _create_validator():
+    """Creates JSON schema validator from the provider.yaml.schema.json"""
     schema = json.loads(importlib_resources.read_text('airflow', 'provider.yaml.schema.json'))
     cls = jsonschema.validators.validator_for(schema)
     validator = cls(schema)
     return validator
 
 
 class ProvidersManager:
-    """Manages all provider packages."""
+    """
+    Manages all provider packages. This is a Singleton class. The first time it is
+    instantiated, it discovers all available providers in installed packages and
+    local source folders (if airflow is run from sources).
+    """
+
+    _instance = None
+    resource_version = "0"
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
 
     def __init__(self):
-        self._provider_directory = {}
-        try:
-            from airflow import providers
-        except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
-            return
+        # Keeps dict of providers keyed by module name and value is Tuple: version, provider_info
+        self._provider_dict: Dict[str, Tuple[str, Dict]] = {}
+        # Keeps dict of hooks keyed by connection type and value is
+        # Tuple: connection class, connection_id_attribute_name
+        self._hooks_dict: Dict[str, Tuple[str, str]] = {}
         self._validator = _create_validator()
-        self.__find_all_providers(providers.__path__)
+        # Local source folders are loaded first. They should take precedence over the package ones for
+        # Development purpose. In production provider.yaml files are not present in the 'airflow" directory
+        # So there is no risk we are going to override package provider accidentally. This can only happen
+        # in case of local development
+        self._discover_all_airflow_builtin_providers_from_local_sources()
+        self._discover_all_providers_from_packages()
+        self._discover_hooks()
+        self._sort_provider_dictionary()
+        self._sort_hooks_dictionary()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    def _sort_hooks_dictionary(self):
+        """
+        Creates provider_directory as sorted (by package_name) OrderedDict.
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
+        Duplicates are removed from "package" providers in case corresponding "folder" provider is found.
+        The "folder" providers are from local sources (packages do not contain provider.yaml files),
+        so if someone has airflow installed from local sources, the providers are imported from there
+        first so, provider information should be taken from there.
+        :return:
+        """
+        sorted_dict = OrderedDict()
+        for connection_type in sorted(self._hooks_dict.keys()):
+            sorted_dict[connection_type] = self._hooks_dict[connection_type]
+        self._hooks_dict = sorted_dict
+
+    def _sort_provider_dictionary(self):
+        """
+        Sort provider_dictionary using OrderedDict.
+
+        The dictionary gets sorted so that when you iterate through it, the providers are by
+        default returned in alphabetical order.
+        """
+        sorted_dict = OrderedDict()
+        for provider_name in sorted(self._provider_dict.keys()):
+            sorted_dict[provider_name] = self._provider_dict[provider_name]
+        self._provider_dict = sorted_dict
+
+    def _discover_all_providers_from_packages(self) -> None:
+        """
+        Discovers all providers by scanning packages installed. The list of providers should be returned
+        via the 'apache_airflow_provider' entrypoint as a dictionary conforming to the
+        'airflow/provider.yaml.schema.json' schema.
+        """
+        for entry_point in pkg_resources.iter_entry_points('apache_airflow_provider'):
+            package_name = entry_point.dist.project_name
+            log.debug("Loading %s from package %s", entry_point, package_name)
+            version = entry_point.dist.version
             try:
-                imported_module = importlib.import_module(module_info.name)
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when importing %s:%s", module_info.name, e)
+                provider_info = entry_point.load()()
+            except pkg_resources.VersionConflict as e:
+                log.warning(
+                    "The provider package %s could not be registered because of version conflict : %s",
+                    package_name,
+                    e,
+                )
                 continue
-            try:
-                provider = importlib_resources.read_text(imported_module, 'provider.yaml')
-                provider_info = yaml.safe_load(provider)
-                self._validator.validate(provider_info)
-                self._provider_directory[provider_info['package-name']] = provider_info
-            except FileNotFoundError:
-                # This is OK - this is not a provider package
-                pass
-            except TypeError as e:
-                if "is not a package" not in str(e):
-                    log.warning("Error when loading 'provider.yaml' file from %s:%s}", module_info.name, e)
-                # Otherwise this is OK - this is likely a module
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when loading 'provider.yaml' file from %s:%s", module_info.name, e)
+            self._validator.validate(provider_info)
+            provider_info_package_name = provider_info['package-name']
+            if package_name != provider_info_package_name:
+                raise Exception(
+                    f"The package '{package_name}' from setuptools and "
+                    f"{provider_info_package_name} do not match. Please make sure they are"
+                    f"aligned"

Review comment:
       ```suggestion
                       "aligned"
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r528750455



##########
File path: airflow/providers/apache/hive/hooks/hive.py
##########
@@ -78,9 +78,13 @@ class HiveCliHook(BaseHook):
     :type  mapred_job_name: str
     """
 
+    conn_name_attr = 'hive_cli_conn_id'
+    default_conn_name = 'hive_cli_default'
+    conn_type = 'hive_cli'
+
     def __init__(
         self,
-        hive_cli_conn_id: str = "hive_cli_default",
+        hive_cli_conn_id: str = default_conn_name,

Review comment:
       I want to do a miniumum set of changes for this discovery to pass, we can definitely improve it in the future. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-734712855


   Fixed @kaxil 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735349806


   > Yandex
   > ssh
   > slack
   > singularity
   > sftp
   > sendgrid
   > segment
   > samba
   
   None of these providers are supported by DBApiiHook, so I don't think they are needed for now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #12466:
URL: https://github.com/apache/airflow/pull/12466


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531483119



##########
File path: airflow/providers/microsoft/mssql/hooks/mssql.py
##########
@@ -70,3 +71,9 @@ def set_autocommit(
 
     def get_autocommit(self, conn: pymssql.connect):  # pylint: disable=c-extension-no-member
         return conn.autocommit_state
+
+
+class MsSqlHookWithOdbcHook(OdbcHook):
+    """Interact with Microsoft SQL Server via ODBC."""
+
+    conn_type = 'mssql'

Review comment:
       Don't we need 
   
   
   ```suggestion
       conn_type = 'mssql'
       conn_name_attr = 'mssql_conn_id'
       default_conn_name = 'mssql_default'
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735268622


   [The Workflow run](https://github.com/apache/airflow/actions/runs/388899377) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735374051


   > > Some of the existing ones are also not DbApiHook either - docker, gcpssh, grpc for example.
   > 
   > We didn't have clear rules on when to add classes, so sometimes classes were added but then never used. Actually in Airflow code, this is only used by operators that use DbApiHook. I did not observe any other use cases.
   
   Yeah, right now it's not. We've been talking about populating the Connections form in the UI with this information, at which point we'll have to add everything. But you are both right in saying one thing at a time.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r528730331



##########
File path: airflow/providers/apache/hive/hooks/hive.py
##########
@@ -78,9 +78,13 @@ class HiveCliHook(BaseHook):
     :type  mapred_job_name: str
     """
 
+    conn_name_attr = 'hive_cli_conn_id'

Review comment:
       What was your thinking for `conn_name_attr` instead of `CONN_NAME_ATTR`?
   
   (My default would be to use the latter, as it's a common signifier that it's a constant.)

##########
File path: airflow/providers/apache/hive/hooks/hive.py
##########
@@ -78,9 +78,13 @@ class HiveCliHook(BaseHook):
     :type  mapred_job_name: str
     """
 
+    conn_name_attr = 'hive_cli_conn_id'
+    default_conn_name = 'hive_cli_default'
+    conn_type = 'hive_cli'
+
     def __init__(
         self,
-        hive_cli_conn_id: str = "hive_cli_default",
+        hive_cli_conn_id: str = default_conn_name,

Review comment:
       We could, if we wanted, use `inspect.signature` to get the default value for the hook:
   
   ```ipython
   In [3]: from airflow.providers.apache.hive.hooks.hive import HiveCliHook                                                                                                                                                                                                                                    
   
   In [4]: import inspect                                                                                                                                                                                                                                                                                      
   
   In [5]: inspect.signature(HiveCliHook).parameters['hive_cli_conn_id'].default                                                                                                                                                                                                                               
   Out[5]: 'hive_cli_default'
   ```
   
   Not needed, and might be overkill, but is just one less thing to need in the "interface"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532082757



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {
+      "type": "array",
+      "items": {
+        "type": "string",
+        "description": "Hooks class names that provide connection types to core"

Review comment:
       OK. I went ahead fixed a few typos and added some missing descriptions in #12690 as the behaviour and descriptions were not consistent in the schema.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532193431



##########
File path: airflow/providers/google/cloud/hooks/bigquery.py
##########
@@ -67,7 +67,9 @@
 class BigQueryHook(GoogleBaseHook, DbApiHook):
     """Interact with BigQuery. This hook uses the Google Cloud connection."""
 
-    conn_name_attr = 'gcp_conn_id'  # type: str
+    conn_name_attr = 'gcp_conn_id'
+    default_conn_name = 'google_cloud_default'
+    conn_type = 'google_cloud_platform'

Review comment:
       Related code: 
   https://github.com/apache/airflow/blob/a6cf883513280de2df80989b81eb65065ec761d1/airflow/providers_manager.py#L199-L209




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-732202460


   > Should we define these properties as required on BaseHook
   
   Will this work if one type of connection doesn't always mean one hook? For GCP, we have many hooks but only 3 connection types.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532091394



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {

Review comment:
       We can easily do it as a follow-up change :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r528753861



##########
File path: airflow/providers/apache/hive/hooks/hive.py
##########
@@ -78,9 +78,13 @@ class HiveCliHook(BaseHook):
     :type  mapred_job_name: str
     """
 
+    conn_name_attr = 'hive_cli_conn_id'
+    default_conn_name = 'hive_cli_default'
+    conn_type = 'hive_cli'
+
     def __init__(
         self,
-        hive_cli_conn_id: str = "hive_cli_default",
+        hive_cli_conn_id: str = default_conn_name,

Review comment:
       I really want to avoid creating new "Hook" interface now. My goal is to have provider hook discovery for now. Adding, standardizing documenting and polishing "common Hook API"  should be next step IMHO - possibly at the time when we introduce reading Hooks via " plugin-lilke" interface.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531548998



##########
File path: airflow/models/connection.py
##########
@@ -30,69 +31,20 @@
 from airflow.exceptions import AirflowException, AirflowNotFoundException
 from airflow.models.base import ID_LEN, Base
 from airflow.models.crypto import get_fernet
+from airflow.providers_manager import ProvidersManager
 from airflow.utils.log.logging_mixin import LoggingMixin
 from airflow.utils.module_loading import import_string
 
-# A map that assigns a connection type to a tuple that contains
-# the path of the class and the name of the conn_id key parameter.
-# PLEASE KEEP BELOW LIST IN ALPHABETICAL ORDER.
-CONN_TYPE_TO_HOOK = {
-    "azure_batch": (
-        "airflow.providers.microsoft.azure.hooks.azure_batch.AzureBatchHook",
-        "azure_batch_conn_id",
-    ),
-    "azure_cosmos": (
-        "airflow.providers.microsoft.azure.hooks.azure_cosmos.AzureCosmosDBHook",
-        "azure_cosmos_conn_id",
-    ),
-    "azure_data_lake": (
-        "airflow.providers.microsoft.azure.hooks.azure_data_lake.AzureDataLakeHook",
-        "azure_data_lake_conn_id",
-    ),
-    "cassandra": ("airflow.providers.apache.cassandra.hooks.cassandra.CassandraHook", "cassandra_conn_id"),
-    "cloudant": ("airflow.providers.cloudant.hooks.cloudant.CloudantHook", "cloudant_conn_id"),
-    "dataprep": ("airflow.providers.google.cloud.hooks.dataprep.GoogleDataprepHook", "dataprep_default"),
-    "docker": ("airflow.providers.docker.hooks.docker.DockerHook", "docker_conn_id"),
-    "elasticsearch": (
-        "airflow.providers.elasticsearch.hooks.elasticsearch.ElasticsearchHook",
-        "elasticsearch_conn_id",
-    ),
-    "exasol": ("airflow.providers.exasol.hooks.exasol.ExasolHook", "exasol_conn_id"),
-    "gcpcloudsql": (
-        "airflow.providers.google.cloud.hooks.cloud_sql.CloudSQLDatabaseHook",
-        "gcp_cloudsql_conn_id",
-    ),
-    "gcpssh": (
-        "airflow.providers.google.cloud.hooks.compute_ssh.ComputeEngineSSHHook",
-        "gcp_conn_id",
-    ),
-    "google_cloud_platform": (
-        "airflow.providers.google.cloud.hooks.bigquery.BigQueryHook",
-        "bigquery_conn_id",
-    ),
-    "grpc": ("airflow.providers.grpc.hooks.grpc.GrpcHook", "grpc_conn_id"),
-    "hive_cli": ("airflow.providers.apache.hive.hooks.hive.HiveCliHook", "hive_cli_conn_id"),
-    "hiveserver2": ("airflow.providers.apache.hive.hooks.hive.HiveServer2Hook", "hiveserver2_conn_id"),
-    "imap": ("airflow.providers.imap.hooks.imap.ImapHook", "imap_conn_id"),
-    "jdbc": ("airflow.providers.jdbc.hooks.jdbc.JdbcHook", "jdbc_conn_id"),
-    "jira": ("airflow.providers.jira.hooks.jira.JiraHook", "jira_conn_id"),
-    "kubernetes": ("airflow.providers.cncf.kubernetes.hooks.kubernetes.KubernetesHook", "kubernetes_conn_id"),
-    "mongo": ("airflow.providers.mongo.hooks.mongo.MongoHook", "conn_id"),
-    "mssql": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "mysql": ("airflow.providers.mysql.hooks.mysql.MySqlHook", "mysql_conn_id"),
-    "odbc": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "oracle": ("airflow.providers.oracle.hooks.oracle.OracleHook", "oracle_conn_id"),
-    "pig_cli": ("airflow.providers.apache.pig.hooks.pig.PigCliHook", "pig_cli_conn_id"),
-    "postgres": ("airflow.providers.postgres.hooks.postgres.PostgresHook", "postgres_conn_id"),
-    "presto": ("airflow.providers.presto.hooks.presto.PrestoHook", "presto_conn_id"),
-    "redis": ("airflow.providers.redis.hooks.redis.RedisHook", "redis_conn_id"),
-    "snowflake": ("airflow.providers.snowflake.hooks.snowflake.SnowflakeHook", "snowflake_conn_id"),
-    "sqlite": ("airflow.providers.sqlite.hooks.sqlite.SqliteHook", "sqlite_conn_id"),
-    "tableau": ("airflow.providers.salesforce.hooks.tableau.TableauHook", "tableau_conn_id"),
-    "vertica": ("airflow.providers.vertica.hooks.vertica.VerticaHook", "vertica_conn_id"),
-    "wasb": ("airflow.providers.microsoft.azure.hooks.wasb.WasbHook", "wasb_conn_id"),
-}
-# PLEASE KEEP ABOVE LIST IN ALPHABETICAL ORDER.
+
+@lru_cache(maxsize=None)
+def get_conn_type_to_hook() -> Dict[str, Tuple[str, str]]:

Review comment:
       Not any more - I think it's a remnant from previous versions where hooks were retrieved on-demand as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735364763


   > > Yandex
   > > ssh
   > > slack
   > > singularity
   > > sftp
   > > sendgrid
   > > segment
   > > samba
   > 
   > None of these providers are supported by DBApiiHook. Do we need to define it then?
   
   Some of the existing ones are also not DbApiHook either - docker, gcpssh, grpc for example.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531566639



##########
File path: dev/provider_packages/SETUP_TEMPLATE.py.jinja2
##########
@@ -59,19 +59,6 @@ def do_setup(version_suffix_for_pypi=''):
         packages=find_namespace_packages(
             include=['airflow.providers.{{ PROVIDER_PACKAGE_ID }}',
                      'airflow.providers.{{ PROVIDER_PACKAGE_ID }}.*']),
-        package_data={ '': [
-{% if PROVIDER_PACKAGE_ID == 'amazon' %}
-          "*.json",
-{% elif PROVIDER_PACKAGE_ID == 'cncf.kubernetes' %}
-          "*.yaml",
-{% elif PROVIDER_PACKAGE_ID == 'google' %}
-          "*.yaml",
-          "*.sql",
-{% elif PROVIDER_PACKAGE_ID == 'papermill' %}
-          "*.ipynb",
-{% endif %}
-            ],
-        },

Review comment:
       Yep - it seems to work with just MANIFEST.in:
   
   ```
   jarek:~/code/airflow] [test-pip-check] add-connection-discovery-to-providers+ 1 ± tar -tvzf dist/apache-airflow-providers-google-1.0.0b2.tar.gz| grep yaml
   -rw-r--r-- root/root       860 2020-11-26 14:08 apache-airflow-providers-google-1.0.0b2/airflow/providers/google/cloud/example_dags/example_cloud_build.yaml
   ```
   
   ```
   [jarek:~/code/airflow] [test-pip-check] add-connection-discovery-to-providers+ ± unzip -t dist/apache_airflow_providers_google-1.0.0b2-py3-none-any.whl | grep yaml
       testing: airflow/providers/google/cloud/example_dags/example_cloud_build.yaml   OK
   ```
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735370364


   [The Workflow run](https://github.com/apache/airflow/actions/runs/389762806) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532193500



##########
File path: airflow/plugins_manager.py
##########
@@ -30,6 +30,7 @@
 import importlib_metadata
 
 from airflow import settings
+from airflow.utils.entry_points_with_dist import entry_points_with_dist

Review comment:
       But I think "entry_point_with_dist" is a better name. we can use it also for non-plugins (providers_manager is technically not a plugin in the sense of "Airflow Plugin". This is a very nice a little small utility function that can be used for anything that wants to quickly retrieve entry_points, not necessarily plugins.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735374917


   > Did something really detect that as a cycle? Cos importing `connection.py` would not cause the provider_manager to import anything from plugins_manager -- only running code would do that, not simply importing. Therefore that is not a cyclic import.
   
   Yes. Pylint did and fail.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r526511344



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -17,6 +17,11 @@
         "type": "string"
       }
     },
+    "provider-package": {

Review comment:
       @ashb @mik-laj -> this is also for preparation to be able to use 3rd-party providers. I've added provider-package to the provider.yaml to avoid repetition and made all modules/classes relative. This way it is really safe - because we will not be able to point accidentally to a class outside of our own package.
   
   I also added a check if provider package is properly named but only when it starts with "airflow.providers". 
   
   Currently we only support airflow.providers anyway. 
   
   It might be good to discuss now and agree future conventions for those 3rd-parties eventually
   
   * 1) are we allowing any package to contain provider?
   
   If not:
   
   * 2a) should all providers live in airflow.providers
   * 2b) should they live in airflow.contrib.providers (or similar)
   
   if 2a) or 2b) - > should  the packages name following the convention:
   
   ```
      apache-airflow-providers-<PROVIDER> (with '.' replaced by -). 
   ```
   
   
   OR maybe some other approach? WDYT?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r531449484



##########
File path: airflow/models/connection.py
##########
@@ -30,69 +31,22 @@
 from airflow.exceptions import AirflowException, AirflowNotFoundException
 from airflow.models.base import ID_LEN, Base
 from airflow.models.crypto import get_fernet
+from airflow.providers_manager import ProvidersManager
 from airflow.utils.log.logging_mixin import LoggingMixin
 from airflow.utils.module_loading import import_string
 
-# A map that assigns a connection type to a tuple that contains
-# the path of the class and the name of the conn_id key parameter.
-# PLEASE KEEP BELOW LIST IN ALPHABETICAL ORDER.
-CONN_TYPE_TO_HOOK = {
-    "azure_batch": (
-        "airflow.providers.microsoft.azure.hooks.azure_batch.AzureBatchHook",
-        "azure_batch_conn_id",
-    ),
-    "azure_cosmos": (
-        "airflow.providers.microsoft.azure.hooks.azure_cosmos.AzureCosmosDBHook",
-        "azure_cosmos_conn_id",
-    ),
-    "azure_data_lake": (
-        "airflow.providers.microsoft.azure.hooks.azure_data_lake.AzureDataLakeHook",
-        "azure_data_lake_conn_id",
-    ),
-    "cassandra": ("airflow.providers.apache.cassandra.hooks.cassandra.CassandraHook", "cassandra_conn_id"),
-    "cloudant": ("airflow.providers.cloudant.hooks.cloudant.CloudantHook", "cloudant_conn_id"),
-    "dataprep": ("airflow.providers.google.cloud.hooks.dataprep.GoogleDataprepHook", "dataprep_default"),
-    "docker": ("airflow.providers.docker.hooks.docker.DockerHook", "docker_conn_id"),
-    "elasticsearch": (
-        "airflow.providers.elasticsearch.hooks.elasticsearch.ElasticsearchHook",
-        "elasticsearch_conn_id",
-    ),
-    "exasol": ("airflow.providers.exasol.hooks.exasol.ExasolHook", "exasol_conn_id"),
-    "gcpcloudsql": (
-        "airflow.providers.google.cloud.hooks.cloud_sql.CloudSQLDatabaseHook",
-        "gcp_cloudsql_conn_id",
-    ),
-    "gcpssh": (
-        "airflow.providers.google.cloud.hooks.compute_ssh.ComputeEngineSSHHook",
-        "gcp_conn_id",
-    ),
-    "google_cloud_platform": (
-        "airflow.providers.google.cloud.hooks.bigquery.BigQueryHook",
-        "bigquery_conn_id",
-    ),
-    "grpc": ("airflow.providers.grpc.hooks.grpc.GrpcHook", "grpc_conn_id"),
-    "hive_cli": ("airflow.providers.apache.hive.hooks.hive.HiveCliHook", "hive_cli_conn_id"),
-    "hiveserver2": ("airflow.providers.apache.hive.hooks.hive.HiveServer2Hook", "hiveserver2_conn_id"),
-    "imap": ("airflow.providers.imap.hooks.imap.ImapHook", "imap_conn_id"),
-    "jdbc": ("airflow.providers.jdbc.hooks.jdbc.JdbcHook", "jdbc_conn_id"),
-    "jira": ("airflow.providers.jira.hooks.jira.JiraHook", "jira_conn_id"),
-    "kubernetes": ("airflow.providers.cncf.kubernetes.hooks.kubernetes.KubernetesHook", "kubernetes_conn_id"),
-    "mongo": ("airflow.providers.mongo.hooks.mongo.MongoHook", "conn_id"),
-    "mssql": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "mysql": ("airflow.providers.mysql.hooks.mysql.MySqlHook", "mysql_conn_id"),
-    "odbc": ("airflow.providers.odbc.hooks.odbc.OdbcHook", "odbc_conn_id"),
-    "oracle": ("airflow.providers.oracle.hooks.oracle.OracleHook", "oracle_conn_id"),
-    "pig_cli": ("airflow.providers.apache.pig.hooks.pig.PigCliHook", "pig_cli_conn_id"),
-    "postgres": ("airflow.providers.postgres.hooks.postgres.PostgresHook", "postgres_conn_id"),
-    "presto": ("airflow.providers.presto.hooks.presto.PrestoHook", "presto_conn_id"),
-    "redis": ("airflow.providers.redis.hooks.redis.RedisHook", "redis_conn_id"),
-    "snowflake": ("airflow.providers.snowflake.hooks.snowflake.SnowflakeHook", "snowflake_conn_id"),
-    "sqlite": ("airflow.providers.sqlite.hooks.sqlite.SqliteHook", "sqlite_conn_id"),
-    "tableau": ("airflow.providers.salesforce.hooks.tableau.TableauHook", "tableau_conn_id"),
-    "vertica": ("airflow.providers.vertica.hooks.vertica.VerticaHook", "vertica_conn_id"),
-    "wasb": ("airflow.providers.microsoft.azure.hooks.wasb.WasbHook", "wasb_conn_id"),
-}
-# PLEASE KEEP ABOVE LIST IN ALPHABETICAL ORDER.
+cache = lru_cache(maxsize=None)
+
+
+@cache

Review comment:
       Yep




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-732205817


   > Will this work if one type of connection doesn't always mean one hook? For GCP, we have many hooks but only 3 connection types.
   
   Yeah. I want to check where we are with all the Hooks and see if we have some that are prolematic ones for that first. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r527277145



##########
File path: airflow/providers_manager.py
##########
@@ -46,42 +45,134 @@ class ProvidersManager:
     """Manages all provider packages."""
 
     def __init__(self):
+        self.__log = logging.getLogger(__name__)
         self._provider_directory = {}
         try:
             from airflow import providers
         except ImportError as e:
-            log.warning("No providers are present or error when importing them! :%s", e)
+            self.__log.warning("No providers are present or error when importing them! :%s", e)
             return
-        self._validator = _create_validator()
+        self.__connections: Dict[str, Tuple[str, str]] = {}
+        self.__validator = _create_validator()
         self.__find_all_providers(providers.__path__)
+        self.__retrieve_connections()
 
-    def __find_all_providers(self, paths: str):
-        def onerror(_):
-            exception_string = traceback.format_exc()
-            log.warning(exception_string)
+    @staticmethod
+    def get_full_class_name(provider_package: str, class_path: str):
+        """
+        Return full class name - if the class starts with ., it adds package_name in front.
+        :param provider_package: Package name
+        :param class_path: class path (might be relative if starts with .)
+        :return: full class name
+        """
+        if class_path.startswith('.'):
+            return f"{provider_package}{class_path}"
+        return class_path
 
-        for module_info in pkgutil.walk_packages(paths, prefix="airflow.providers.", onerror=onerror):
-            try:
-                imported_module = importlib.import_module(module_info.name)
-            except Exception as e:  # noqa pylint: disable=broad-except
-                log.warning("Error when importing %s:%s", module_info.name, e)
-                continue
+    def __find_all_providers(self, paths: str) -> None:
+        """
+        Finds all providers installed and stores then in the directory of the providers.
+        :param paths: Path where to look for providers
+        """
+        import warnings
+
+        old_warn = warnings.warn
+
+        def warn(*args, **kwargs):
+            if "deprecated" not in args[0]:
+                old_warn(*args, **kwargs)
+
+        warnings.warn = warn

Review comment:
       Ah nice!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-731958275


   @ashb -> I think I realized what you wanted to achieve and I looked at the other connections we had, and it turned out that we nearly had standard for those class attributes that I just had to make sure to synchronize with. Indeed this will be easier in the future to implement "basic" plugin-like hook behavior. It's not yet implemented (and it is out-of-scope for this change) but there is nothing to prevent adding such plugin behaviour in the near future.
   
   I hope we have something that we are both going to be happy with.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532084053



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -125,6 +127,7 @@
             "description": "Integration name. It must have a matching item in the 'integration' section of any provider."
           },
           "python-modules": {
+            "description": "Name of the module.",

Review comment:
       I fixed all the other similar typos/grammar/missing descriptions in #12690 in the original schema as the use was not relly consistent originally.. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735349806


   > Yandex
   > ssh
   > slack
   > singularity
   > sftp
   > sendgrid
   > segment
   > samba
   
   None of these providers are supported by DBApiiHook. Do we need to define it then?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb edited a comment on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
ashb edited a comment on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735373831


   > Just one last fixup - we had cyclic dependency Connection -> provider_manager -> plugins_manager -> Hook -> Connections. Solved it by extracting entrypoint_with_dist to a separate util module (as this is an util used by both plugins_manager and providers_manager now).
   
   Did something really detect that as a cycle? Cos importing `connection.py` would not cause the provider_manager to import anything from plugins_manager -- only running code would do that, not simply importing. Therefore that is not a cyclic import.
   
   (I'm okay with moving it to a separate module anyway, I was just being a little bit lazy last night when adding it. I'm still surprised that this was claimed to be a cycle though.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-735349055


   @potiuk I agree with Jarek. Let's focus on making the necessary changes to the core. Additional improvements in providers, we can make regardless of core changes.  We can now easily update them independently from the core, so it will be a very nice task as a good-first-issue.  I imagine each provider will have described operators, connection configurations, and other components if they have them.
   A good example, but not a perfect one, is the [`apache-airflow-providers-google `](http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow-providers-google/latest/index.html)package.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-732205347


   > Should we define these properties as required on BaseHook?
   
   We already have different hierarchies of hooks. We might definitely do that later if we find out that we have all the hooks covered. But we can also add it as a follow-up


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on a change in pull request #12466: Adds support for Connection/Hook discovery from providers

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#discussion_r532060641



##########
File path: airflow/provider.yaml.schema.json
##########
@@ -167,6 +170,13 @@
           "python-module"
         ]
       }
+    },
+    "hook-class-names": {
+      "type": "array",
+      "items": {
+        "type": "string",
+        "description": "Hooks class names that provide connection types to core"

Review comment:
       ```suggestion
         "type": "array",
         "description": "Hooks class names that provide connection types to core"
         "items": {
           "type": "string",
   ```
   If we describe arrays, we can use the plural in the sentence. If we describe an item, we should use singular form in the description.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #12466: Adds support for Connection discovery from providers

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #12466:
URL: https://github.com/apache/airflow/pull/12466#issuecomment-731821859


   HEy @ashb -> that is another case for the "extra links" -> this one is simpler because we really just need the class names and nothing more. And I believe this is something similar you wanted to do for Connections via #12466 - but let's talk about it tomorrow


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org