You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/12/14 20:35:02 UTC

[GitHub] [airflow] marshall7m opened a new pull request #13072: AWS Glue Crawler Integration

marshall7m opened a new pull request #13072:
URL: https://github.com/apache/airflow/pull/13072

<!--
Thank you for contributing! Please make sure that your code changes
are covered with tests. And in case of new features or big changes
remember to adjust the documentation.

Feel free to ping committers for the review!

In case of existing issue, reference it using one of the following:

closes: #ISSUE
related: #ISSUE

How to write a good git commit message:
http://chris.beams.io/posts/git-commit/
-->
## Description

This PR integrates a AWS glue crawler operator and hook that can be used to trigger glue crawlers from Airflow.

---
**^ Add meaningful description above**

Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] feluelle commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

feluelle commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-762807120


   @dstandish @mik-laj any objections before I merge?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r551508779



##########
File path: airflow/providers/amazon/aws/operators/glue_crawler.py
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from cached_property import cached_property
+
+from airflow.models import BaseOperator
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerOperator(BaseOperator):
+    """
+    Creates, updates and triggers an AWS Glue Crawler. AWS Glue Crawler is a serverless
+    service that manages a catalog of metadata tables that contain the inferred
+    schema, format and data types of data stores within the AWS cloud.
+
+    :param config: Configurations for the AWS Glue crawler
+    :type config: dict
+    :param aws_conn_id: aws connection to use
+    :type aws_conn_id: Optional[str]
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: Optional[int]
+    """
+
+    ui_color = '#ededed'
+
+    @apply_defaults
+    def __init__(
+        self,
+        config,
+        aws_conn_id='aws_default',
+        poll_interval: int = 5,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.aws_conn_id = aws_conn_id
+        self.poll_interval = poll_interval
+        self.config = config
+
+    @cached_property
+    def hook(self) -> AwsGlueCrawlerHook:
+        """Create and return an AwsGlueCrawlerHook."""
+        return AwsGlueCrawlerHook(self.aws_conn_id)
+
+    def execute(self, context):
+        """
+        Executes AWS Glue Crawler from Airflow
+        :return: the name of the current glue crawler.
+        """
+        crawler_name = self.hook.get_or_create_crawler(**self.config)
+        self.log.info("Triggering AWS Glue Crawler")
+        self.hook.start_crawler(crawler_name)
+        self.log.info("Waiting for AWS Glue Crawler")
+        self.hook.wait_for_crawler_completion(crawler_name)

Review comment:
       Thanks! Currently working on it,  should have a revision up in 10 mins.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r543723578



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,282 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List, Optional
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_target_configuration = Configurations for crawling AWS S3 paths
+    :type s3_target_configuration = Optional[list]
+    :param jdbc_target_configuration = Configurations for crawling JDBC paths
+    :type jdbc_target_configuration = Optional[list]
+    :param mongo_target_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_target_configuration = Optional[list]
+    :param dynamo_target_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_target_configuration = Optional[list]
+    :param glue_catalog_target_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_target_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    """
+
+    CRAWLER_POLL_INTERVAL = 6  # polls crawler status after every CRAWLER_POLL_INTERVAL seconds
+
+    def __init__(
+        self,
+        crawler_name=None,
+        crawler_desc=None,
+        glue_db_name=None,
+        iam_role_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,
+        update_behavior=None,
+        delete_behavior=None,
+        recrawl_behavior=None,
+        lineage_settings=None,
+        json_configuration=None,
+        security_configuration=None,
+        tags=None,
+        *args,
+        **kwargs,
+    ):
+
+        self.crawler_name = crawler_name
+        self.crawler_desc = crawler_desc
+        self.glue_db_name = glue_db_name
+        self.iam_role_name = iam_role_name
+        self.s3_targets_configuration = s3_targets_configuration
+        self.jdbc_targets_configuration = jdbc_targets_configuration
+        self.mongo_targets_configuration = mongo_targets_configuration
+        self.dynamo_targets_configuration = dynamo_targets_configuration
+        self.glue_catalog_targets_configuration = glue_catalog_targets_configuration
+        self.cron_schedule = cron_schedule
+        self.classifiers = classifiers
+        self.table_prefix = table_prefix
+        self.update_behavior = update_behavior
+        self.delete_behavior = delete_behavior
+        self.recrawl_behavior = recrawl_behavior
+        self.lineage_settings = lineage_settings
+        self.json_configuration = json_configuration
+        self.security_configuration = security_configuration
+        self.tags = tags
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    def list_crawlers(self) -> List:
+        ":return: Lists of Crawlers"
+        conn = self.get_conn()
+        return conn.get_crawlers()
+
+    def get_iam_execution_role(self) -> Dict:
+        ":return: iam role for crawler execution"
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=self.iam_role_name)
+            self.log.info("Iam Role Name: %s", self.iam_role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def initialize_crawler(self):
+        """
+        Initializes connection with AWS Glue to run crawler
+        :return:
+        """
+        glue_client = self.get_conn()
+
+        try:
+            crawler_name = self.get_or_create_glue_crawler()
+            crawler_run = glue_client.start_crawler(Name=crawler_name)
+            return crawler_run
+        except Exception as general_error:
+            self.log.error("Failed to run aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+        :param crawler_name: unique crawler name per AWS account

Review comment:
       ```suggestion
   
           :param crawler_name: unique crawler name per AWS account
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r543722769



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,282 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List, Optional
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account

Review comment:
       ```suggestion
   
       :param crawler_name = Unique crawler name per AWS Account
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] github-actions[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-762816746


   The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest master or amend the last commit of the PR, and push it with --force-with-lease.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r556716347



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def get_or_create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create/update the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        try:
+            glue_response = self.glue_client.get_crawler(Name=crawler_name)
+            self.log.info("Crawler %s already exists; updating crawler", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.EntityNotFoundException:
+            self.log.info("Creating AWS Glue crawler %s", crawler_name)
+            try:
+                glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+                return glue_response['Crawler']['Name']
+            except self.glue_client.exceptions.InvalidInputException as general_error:
+                self.check_iam_role(crawler_kwargs['Role'])
+                raise AirflowException(general_error)

Review comment:
       That's a great idea! Seems like it will remove a future pain point in regards to the current `get_or_create_crawler()` function.  Specifically, in a scenario where the user uses the function to just see if the crawler exists but accidentally creates a typo with the crawler_name and ends up creating a new crawler with the same configurations but with a slightly different name. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r556687455



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict

Review comment:
       silly me must have somehow mixed up the operator with the hook param




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r543723215



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,282 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List, Optional
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]

Review comment:
       ```suggestion
       :type crawler_name: Optional[str]
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560145904



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs = Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs = any

Review comment:
       ```suggestion
           :param crawler_kwargs: Keyword args that define the configurations used for the crawler
           :type crawler_kwargs: any
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mschmo commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mschmo commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r551636429



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def get_or_create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create/update the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        try:
+            glue_response = self.glue_client.get_crawler(Name=crawler_name)
+            self.log.info("Crawler %s already exists; updating crawler", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)

Review comment:
       Nice, I overlooked this API method when implementing a custom Glue hook/operator in my own project. Good to know about!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] github-actions[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-745618250


   [The Workflow run](https://github.com/apache/airflow/actions/runs/424361603) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r543723773



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,282 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List, Optional
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_target_configuration = Configurations for crawling AWS S3 paths
+    :type s3_target_configuration = Optional[list]
+    :param jdbc_target_configuration = Configurations for crawling JDBC paths
+    :type jdbc_target_configuration = Optional[list]
+    :param mongo_target_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_target_configuration = Optional[list]
+    :param dynamo_target_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_target_configuration = Optional[list]
+    :param glue_catalog_target_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_target_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    """
+
+    CRAWLER_POLL_INTERVAL = 6  # polls crawler status after every CRAWLER_POLL_INTERVAL seconds
+
+    def __init__(
+        self,
+        crawler_name=None,
+        crawler_desc=None,
+        glue_db_name=None,
+        iam_role_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,
+        update_behavior=None,
+        delete_behavior=None,
+        recrawl_behavior=None,
+        lineage_settings=None,
+        json_configuration=None,
+        security_configuration=None,
+        tags=None,
+        *args,
+        **kwargs,
+    ):
+
+        self.crawler_name = crawler_name
+        self.crawler_desc = crawler_desc
+        self.glue_db_name = glue_db_name
+        self.iam_role_name = iam_role_name
+        self.s3_targets_configuration = s3_targets_configuration
+        self.jdbc_targets_configuration = jdbc_targets_configuration
+        self.mongo_targets_configuration = mongo_targets_configuration
+        self.dynamo_targets_configuration = dynamo_targets_configuration
+        self.glue_catalog_targets_configuration = glue_catalog_targets_configuration
+        self.cron_schedule = cron_schedule
+        self.classifiers = classifiers
+        self.table_prefix = table_prefix
+        self.update_behavior = update_behavior
+        self.delete_behavior = delete_behavior
+        self.recrawl_behavior = recrawl_behavior
+        self.lineage_settings = lineage_settings
+        self.json_configuration = json_configuration
+        self.security_configuration = security_configuration
+        self.tags = tags
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    def list_crawlers(self) -> List:
+        ":return: Lists of Crawlers"
+        conn = self.get_conn()
+        return conn.get_crawlers()
+
+    def get_iam_execution_role(self) -> Dict:
+        ":return: iam role for crawler execution"
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=self.iam_role_name)
+            self.log.info("Iam Role Name: %s", self.iam_role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def initialize_crawler(self):
+        """
+        Initializes connection with AWS Glue to run crawler
+        :return:
+        """
+        glue_client = self.get_conn()
+
+        try:
+            crawler_name = self.get_or_create_glue_crawler()
+            crawler_run = glue_client.start_crawler(Name=crawler_name)
+            return crawler_run
+        except Exception as general_error:
+            self.log.error("Failed to run aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_state = crawler_run['Crawler']['State']
+        return crawler_run_state
+
+    def get_crawler_status(self, crawler_name: str) -> str:
+        """
+        Get current status of the Glue crawler. The crawler
+        status can be succeeded, cancelled, or failed.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Status of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_status = crawler_run['Crawler']['LastCrawl']['Status']
+        return crawler_run_status
+
+    def crawler_completion(self, crawler_name: str) -> str:
+        """
+        Waits until Glue crawler with crawler_name completes or
+        fails and returns final state if finished.
+        Raises AirflowException when the crawler failed
+        :param crawler_name: unique crawler name per AWS account

Review comment:
       ```suggestion
   
           :param crawler_name: unique crawler name per AWS account
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] feluelle commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

feluelle commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560744852



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       > in testing we saw that when InvalidInputException is raised due to role, it mentions the role in the error. so it seems pointless to me.
   
   Got it. If that's really the case then we do not need and probably should not check the iam role again. We save a request :) Only if it does not mention the role it makes sense.
   
   (sry Daniel I missed the paragraph 😬 )
   > i would leave it with the original error, which already mentions a problem related to the role when that's the cause of the exception




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r548649373



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,181 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def get_iam_execution_role(self, role_name) -> str:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=role_name)
+            self.log.info("Iam Role Name: %s", role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_or_create_crawler(self, config) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param config = Configurations for the AWS Glue crawler
+        :type config = Optional[dict]
+        :return: Name of the crawler
+        """
+        self.get_iam_execution_role(config["Role"])

Review comment:
       It's more of a check to see if the role exists/valid. I was using the glue job hook (`glue.py`) as a reference in which they used that method. Similar to what you said in another suggestion, I think the Glue API exception for invalid crawler IAM role should hopefully be clear enough. I'll check and probably remove this method from the hook.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560319861



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       i am not sure why you would want to check iam role if you're gonna raise anyway?
   why not just let the exception go uncaught?
   otherwise looking good now :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560295521



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs = Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs = any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])
+            raise AirflowException(general_error)
+
+    def start_crawler(self, crawler_name: str) -> dict:
+        """
+        Triggers the AWS Glue crawler
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Empty dictionary
+        """
+        crawler = self.glue_client.start_crawler(Name=crawler_name)
+        return crawler
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        crawler = self.glue_client.get_crawler(Name=crawler_name)
+        crawler_state = crawler['Crawler']['State']

Review comment:
       maybe consolidate `get_crawler_status()` and `get_crawler_state()` into one `get_crawler()` function that returns the results of the `glue_client.get_crawler()` and then just index into the response to get the state/status within `wait_for_crawler_completion()`?
   
   something like this:
   
   ```
   while True:
               crawler = self.get_crawler(crawler_name)
               crawler_state = crawler['Crawler']['State']
               if crawler_state == 'READY':
                   self.log.info("State: %s", crawler_state)
                   crawler_status = crawler['Crawler']['LastCrawl']['Status']
                   if crawler_status in failed_status:
                       raise AirflowException(
                           f"Status: {crawler_status}"
                       )  # pylint: disable=raising-format-tuple
                   else:
                       metrics = self.get_crawler_metrics(crawler_name)
                       self.log.info("Status: %s", crawler_status)
                       self.log.info("Last Runtime Duration (seconds): %s", metrics['LastRuntimeSeconds'])
                       self.log.info("Median Runtime Duration (seconds): %s", metrics['MedianRuntimeSeconds'])
                       self.log.info("Tables Created: %s", metrics['TablesCreated'])
                       self.log.info("Tables Updated: %s", metrics['TablesUpdated'])
                       self.log.info("Tables Deleted: %s", metrics['TablesDeleted'])
   
                       return crawler_status
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] feluelle commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

feluelle commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r556726270



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def get_or_create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create/update the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        try:
+            glue_response = self.glue_client.get_crawler(Name=crawler_name)
+            self.log.info("Crawler %s already exists; updating crawler", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.EntityNotFoundException:
+            self.log.info("Creating AWS Glue crawler %s", crawler_name)
+            try:
+                glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+                return glue_response['Crawler']['Name']
+            except self.glue_client.exceptions.InvalidInputException as general_error:
+                self.check_iam_role(crawler_kwargs['Role'])
+                raise AirflowException(general_error)

Review comment:
       Sure. Sounds good. :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mschmo commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mschmo commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r550312524



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,159 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval

Review comment:
       IMO `poll_interval` shouldn't be a required argument for the hook. It makes sense as a required argument for the sensor, but the hook I think should be a bit more vague in how it can interact with the Glue API. For instance `poll_interval` is irrelevant if I want to initialize a hook solely to get or create a crawler, and not poll for its status.
   
   I believe it may be better to pass `poll_interval` in as an argument to the `wait_for_crawler_completion` method, which is the only place it's used atm.

##########
File path: airflow/providers/amazon/aws/sensors/glue_crawler.py
##########
@@ -0,0 +1,51 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.sensors.base import BaseSensorOperator
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerSensor(BaseSensorOperator):
+    """
+    Waits for an AWS Glue crawler to reach any of the statuses below
+    'FAILED', 'CANCELLED', 'SUCCEEDED'

Review comment:
       FYI after the initial run the crawler's last crawl status will **always** be one of these statuses. Do you think it would make more sense to first check that the crawler is in a "READY" state. I feel like this sensor would be much more useful if it behaved similar to the hook's `wait_for_crawler_completion` and confirmed that the crawler was not currently running.

##########
File path: airflow/providers/amazon/aws/sensors/glue_crawler.py
##########
@@ -0,0 +1,51 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.sensors.base import BaseSensorOperator
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerSensor(BaseSensorOperator):
+    """
+    Waits for an AWS Glue crawler to reach any of the statuses below
+    'FAILED', 'CANCELLED', 'SUCCEEDED'
+    :param crawler_name: The AWS Glue crawler unique name
+    :type crawler_name: str
+    """
+
+    @apply_defaults
+    def __init__(self, *, crawler_name: str, aws_conn_id: str = 'aws_default', **kwargs):
+        super().__init__(**kwargs)
+        self.crawler_name = crawler_name
+        self.aws_conn_id = aws_conn_id
+        self.success_statuses = 'SUCCEEDED'
+        self.errored_statuses = ('FAILED', 'CANCELLED')
+
+    def poke(self, context):
+        hook = AwsGlueCrawlerHook(aws_conn_id=self.aws_conn_id, poll_interval=5)

Review comment:
       More reason IMO to remove `poll_interval` from the hook initialization. This is somewhat confusing because it makes it seem like we're hard-coding a 5 second poll interval, but really we're using the base sensor's `poll_interval` attribute between polls.

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,159 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def get_iam_execution_role(self, role_name) -> str:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+        return role_name
+
+    def get_or_create_crawler(self, config) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param config = Configurations for the AWS Glue crawler
+        :type config = dict
+        :return: Name of the crawler
+        """
+        crawler_name = config["Name"]
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            self.log.info("Crawler %s already exists; updating crawler", crawler_name)
+            self.glue_client.update_crawler(**config)
+        except self.glue_client.exceptions.EntityNotFoundException:
+            self.get_iam_execution_role(config["Role"])
+            self.log.info("Creating AWS Glue crawler %s", crawler_name)
+            self.glue_client.create_crawler(**config)
+        return crawler_name
+
+    def start_crawler(self, crawler_name: str) -> str:
+        """
+        Triggers the AWS Glue crawler
+        :return: Empty dictionary
+        """
+        crawler = self.glue_client.start_crawler(Name=crawler_name)
+        return crawler
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        crawler = self.glue_client.get_crawler(Name=crawler_name)
+        crawler_state = crawler['Crawler']['State']
+        return crawler_state
+
+    def get_last_crawl_status(self, crawler_name: str) -> str:
+        """
+        Get the status of the latest crawl run. The crawl
+        status can be succeeded, cancelled, or failed.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Status of the Glue crawler
+        """
+        crawler = self.glue_client.get_crawler(Name=crawler_name)
+        last_crawl_status = crawler['Crawler']['LastCrawl']['Status']
+        return last_crawl_status
+
+    def wait_for_crawler_completion(self, crawler_name: str) -> str:
+        """
+        Waits until Glue crawler completes or
+        fails and returns the final state if finished.
+        Raises AirflowException when the crawler failed
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Dict of crawler's status
+        """
+        failed_status = ['FAILED', 'CANCELLED']
+
+        while True:
+            crawler_state = self.get_crawler_state(crawler_name)
+            if crawler_state == 'READY':
+                self.log.info("State: %s", crawler_state)
+                crawler_status = self.get_last_crawl_status(crawler_name)
+                if crawler_status in failed_status:
+                    raise AirflowException(
+                        f"Status: {crawler_status}"
+                    )  # pylint: disable=raising-format-tuple
+                else:
+                    metrics = self.get_crawler_metrics(crawler_name)
+                    self.log.info("Status: %s", crawler_status)
+                    self.log.info("Last Runtime Duration (seconds): %s", metrics['LastRuntimeSeconds'])
+                    self.log.info("Median Runtime Duration (seconds): %s", metrics['MedianRuntimeSeconds'])
+                    self.log.info("Tables Created: %s", metrics['TablesCreated'])
+                    self.log.info("Tables Updated: %s", metrics['TablesUpdated'])
+                    self.log.info("Tables Deleted: %s", metrics['TablesDeleted'])
+
+                    return crawler_status
+
+            else:
+                self.log.info("Polling for AWS Glue crawler: %s ", crawler_name)
+                self.log.info("State: %s", crawler_state)
+
+                sleep(self.poll_interval)
+
+                metrics = self.get_crawler_metrics(crawler_name)
+                time_left = int(metrics['TimeLeftSeconds'])
+
+                if time_left > 0:
+                    print('Estimated Time Left (seconds): ', time_left)
+                    self.poll_interval = time_left
+                else:
+                    print('Crawler should finish soon')

Review comment:
       Should be using `self.log` instead of `print`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-767023736


   Awesome thanks @feluelle for merging. Thank you all for the kind and constructive feedback along the way! @feluelle @mik-laj @dstandish @mschmo


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-767023736


   Awesome thanks @feluelle for merging. Thank you all for the kind and constructive feedback along the way! @feluelle @mik-laj @dstandish @mschmo


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] boring-cyborg[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

boring-cyborg[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-744694482

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)
Here are some useful points:
- Pay attention to the quality of your code (flake8, pylint and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/master/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
- In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/master/docs/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
- Consider using [Breeze environment](https://github.com/apache/airflow/blob/master/BREEZE.rst) for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
- Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
- Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
- Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#coding-style-and-best-practices).
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r549739066



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,181 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def get_iam_execution_role(self, role_name) -> str:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=role_name)
+            self.log.info("Iam Role Name: %s", role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_or_create_crawler(self, config) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param config = Configurations for the AWS Glue crawler
+        :type config = Optional[dict]
+        :return: Name of the crawler
+        """
+        self.get_iam_execution_role(config["Role"])

Review comment:
       @dstandish I tested the glue API's create_crawler() method with an invalid role and the exception was misleading because it outputted: ```botocore.errorfactory.InvalidInputException: An error occurred (InvalidInputException) when calling the CreateCrawler operation: Service is unable to assume role arn:aws:iam::111111111111:role/test-foo-role. Please verify role's TrustPolicy```.  The `self.get_iam_execution_role()` outputted a more direct exception: ```botocore.errorfactory.NoSuchEntityException: An error occurred (NoSuchEntity) when calling the GetRole operation: The role with name test-foo-role cannot be found.```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] feluelle commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

feluelle commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r553260972



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict

Review comment:
       ```suggestion
   ```
   ?

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name

Review comment:
       ```suggestion
           non-existing role as a role trust policy error.
   
           :param role_name = IAM role name
   ```
   Please put an empty line between description and params.

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name

Review comment:
       And check this on the rest of your code, please. Thanks :)

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def get_or_create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create/update the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        try:
+            glue_response = self.glue_client.get_crawler(Name=crawler_name)
+            self.log.info("Crawler %s already exists; updating crawler", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.EntityNotFoundException:
+            self.log.info("Creating AWS Glue crawler %s", crawler_name)
+            try:
+                glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+                return glue_response['Crawler']['Name']
+            except self.glue_client.exceptions.InvalidInputException as general_error:
+                self.check_iam_role(crawler_kwargs['Role'])
+                raise AirflowException(general_error)

Review comment:
       Can you maybe move this into a separate `create_crawler` function?
   
   I would also move the get-or-create logic on the operator level. So that in the execute you would:
   ```python
   if not self.hook.has_crawler(**self.config):
       self.hook.create_crawler(**self.config)
   crawler_name = self.hook.get_crawler(**self.config)
   ```
   where `has_crawler` gets the crawler and handles the exception in case it does not exist i.e. returns False.
   
   WDYT?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560489585



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       i remember this discussion
   
   i would leave it with the original error.
   
   i think adding this only adds confusion.
   
   my 2 cents 🤦‍♂️
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mschmo commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mschmo commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r550317305



##########
File path: airflow/providers/amazon/aws/sensors/glue_crawler.py
##########
@@ -0,0 +1,51 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.sensors.base import BaseSensorOperator
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerSensor(BaseSensorOperator):
+    """
+    Waits for an AWS Glue crawler to reach any of the statuses below
+    'FAILED', 'CANCELLED', 'SUCCEEDED'

Review comment:
       FYI after the initial run the crawler's last crawl status will **always** be one of these statuses. Do you think it would make more sense to first check that the crawler is in a "READY" state? I feel like this sensor would be much more useful if it behaved similar to the hook's `wait_for_crawler_completion` and confirmed that the crawler was not currently running.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560742156



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       my thought is, why create a method `check_iam_role` that serves no purpose other than throwing an error, when the only reason you're calling it is because you _already know_ the role is bad.
   
   in testing we saw that when InvalidInputException is raised due to role, it mentions the role in the error.  so it seems pointless to me.
   
   but, it's also not really going to kill me if this goes in so i'll quit yapping about it after this :) 

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       my point is, why create a method `check_iam_role` that serves no purpose other than throwing an error, when the only reason you're calling it is because you _already know_ the role is bad.
   
   in testing we saw that when InvalidInputException is raised due to role, it mentions the role in the error.  so it seems pointless to me.
   
   but, it's also not really going to kill me if this goes in so i'll quit yapping about it after this :) 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r548649373



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,181 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def get_iam_execution_role(self, role_name) -> str:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=role_name)
+            self.log.info("Iam Role Name: %s", role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_or_create_crawler(self, config) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param config = Configurations for the AWS Glue crawler
+        :type config = Optional[dict]
+        :return: Name of the crawler
+        """
+        self.get_iam_execution_role(config["Role"])

Review comment:
       It's more of a check to see if the role exists/valid. I was using the glue job hook (`glue.py`) as a reference in which they used that method. As you said in another suggestion, I think the Glue API should be clear enough to catch when the glue crawler IAM role is invalid. I'll remove this method from the hook.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] github-actions[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-761655700






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] github-actions[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-752260700


   [The Workflow run](https://github.com/apache/airflow/actions/runs/451575217) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560295521



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs = Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs = any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])
+            raise AirflowException(general_error)
+
+    def start_crawler(self, crawler_name: str) -> dict:
+        """
+        Triggers the AWS Glue crawler
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Empty dictionary
+        """
+        crawler = self.glue_client.start_crawler(Name=crawler_name)
+        return crawler
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        crawler = self.glue_client.get_crawler(Name=crawler_name)
+        crawler_state = crawler['Crawler']['State']

Review comment:
       maybe consolidate `get_crawler_status()` and `get_crawler_state()` into one `get_crawler()` function that returns the results of the `glue_client.get_crawler()` and then just index into the response to get the state/status within `wait_for_crawler_completion()`?
   
   something like this within `wait_for_crawler_completion()`:
   
   ```
   while True:
               crawler = self.get_crawler(crawler_name)
               crawler_state = crawler['Crawler']['State']
               if crawler_state == 'READY':
                   self.log.info("State: %s", crawler_state)
                   crawler_status = crawler['Crawler']['LastCrawl']['Status']
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r548085121



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,294 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account

Review comment:
       generally i think hook init params are focused on authentication and connection behavior
   
   i think it might make sense to move most of these kwargs to the `get_or_create_crawler` method
   
   sort of like how with s3hook you don't provide key and bucket at init, or with sql hook you don't provide sql at init -- but you provide methods for interacting with the service.
   
   then you could instantiate a hook once and use it to deal with multiple crawlers.  
   
   then, re your slack question about too many params,  perhaps the method becomes `get_or_create_crawler(**crawler_kwargs)`
   

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,294 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_target_configuration = Configurations for crawling AWS S3 paths
+    :type s3_target_configuration = Optional[list]
+    :param jdbc_target_configuration = Configurations for crawling JDBC paths
+    :type jdbc_target_configuration = Optional[list]
+    :param mongo_target_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_target_configuration = Optional[list]
+    :param dynamo_target_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_target_configuration = Optional[list]
+    :param glue_catalog_target_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_target_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    :param overwrite = Determines if crawler should be updated if the crawler configuration change
+    :type overwrite = Optional[bool]
+    """
+
+    CRAWLER_POLL_INTERVAL = 6  # polls crawler status after every CRAWLER_POLL_INTERVAL seconds
+
+    def __init__(
+        self,
+        crawler_name=None,
+        crawler_desc=None,
+        glue_db_name=None,
+        iam_role_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,
+        update_behavior=None,
+        delete_behavior=None,
+        recrawl_behavior=None,
+        lineage_settings=None,
+        json_configuration=None,
+        security_configuration=None,
+        tags=None,
+        overwrite=False,
+        *args,
+        **kwargs,
+    ):
+
+        self.crawler_name = crawler_name
+        self.crawler_desc = crawler_desc
+        self.glue_db_name = glue_db_name
+        self.iam_role_name = iam_role_name
+        self.s3_targets_configuration = s3_targets_configuration
+        self.jdbc_targets_configuration = jdbc_targets_configuration
+        self.mongo_targets_configuration = mongo_targets_configuration
+        self.dynamo_targets_configuration = dynamo_targets_configuration
+        self.glue_catalog_targets_configuration = glue_catalog_targets_configuration
+        self.cron_schedule = cron_schedule
+        self.classifiers = classifiers
+        self.table_prefix = table_prefix
+        self.update_behavior = update_behavior
+        self.delete_behavior = delete_behavior
+        self.recrawl_behavior = recrawl_behavior
+        self.lineage_settings = lineage_settings
+        self.json_configuration = json_configuration
+        self.security_configuration = security_configuration
+        self.tags = tags
+        self.overwrite = overwrite
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    def list_crawlers(self) -> List:
+        """:return: Lists of Crawlers"""
+        conn = self.get_conn()
+        return conn.get_crawlers()
+
+    def get_iam_execution_role(self) -> Dict:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=self.iam_role_name)
+            self.log.info("Iam Role Name: %s", self.iam_role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def initialize_crawler(self):
+        """
+        Initializes connection with AWS Glue to run crawler
+        :return:
+        """
+        glue_client = self.get_conn()

Review comment:
       one thing i like to do in hooks is create a cached propert such as `client` or `glue_client` or in other hooks `session`, so that you don't need to remember to call `get_conn` in every case, and you can just reuse the same authenticated object without reauthenticating.
   
   e.g. 
   
   ```python
   @cached_property
   def glue_client(self):
       return self.get_conn()
   ```

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,294 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_target_configuration = Configurations for crawling AWS S3 paths
+    :type s3_target_configuration = Optional[list]
+    :param jdbc_target_configuration = Configurations for crawling JDBC paths
+    :type jdbc_target_configuration = Optional[list]
+    :param mongo_target_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_target_configuration = Optional[list]
+    :param dynamo_target_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_target_configuration = Optional[list]
+    :param glue_catalog_target_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_target_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    :param overwrite = Determines if crawler should be updated if the crawler configuration change
+    :type overwrite = Optional[bool]
+    """
+
+    CRAWLER_POLL_INTERVAL = 6  # polls crawler status after every CRAWLER_POLL_INTERVAL seconds
+
+    def __init__(
+        self,
+        crawler_name=None,
+        crawler_desc=None,
+        glue_db_name=None,
+        iam_role_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,
+        update_behavior=None,
+        delete_behavior=None,
+        recrawl_behavior=None,
+        lineage_settings=None,
+        json_configuration=None,
+        security_configuration=None,
+        tags=None,
+        overwrite=False,
+        *args,
+        **kwargs,
+    ):
+
+        self.crawler_name = crawler_name
+        self.crawler_desc = crawler_desc
+        self.glue_db_name = glue_db_name
+        self.iam_role_name = iam_role_name
+        self.s3_targets_configuration = s3_targets_configuration
+        self.jdbc_targets_configuration = jdbc_targets_configuration
+        self.mongo_targets_configuration = mongo_targets_configuration
+        self.dynamo_targets_configuration = dynamo_targets_configuration
+        self.glue_catalog_targets_configuration = glue_catalog_targets_configuration
+        self.cron_schedule = cron_schedule
+        self.classifiers = classifiers
+        self.table_prefix = table_prefix
+        self.update_behavior = update_behavior
+        self.delete_behavior = delete_behavior
+        self.recrawl_behavior = recrawl_behavior
+        self.lineage_settings = lineage_settings
+        self.json_configuration = json_configuration
+        self.security_configuration = security_configuration
+        self.tags = tags
+        self.overwrite = overwrite
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    def list_crawlers(self) -> List:
+        """:return: Lists of Crawlers"""
+        conn = self.get_conn()
+        return conn.get_crawlers()
+
+    def get_iam_execution_role(self) -> Dict:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=self.iam_role_name)
+            self.log.info("Iam Role Name: %s", self.iam_role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def initialize_crawler(self):

Review comment:
       i might consider chopping this method because initialize is a bit confusing and it doesn't really do much.
   
   then in your operator execute you would just call
   ```python
   crawler_name = hook.get_or_create_glue_crawler(**crawler_kwargs)
   hook.start_crawler(crawler_name)
   hook.await_crawler(crawler_name)
   ```
   which i think is a tad clearer
   

##########
File path: airflow/providers/amazon/aws/operators/glue_crawler.py
##########
@@ -0,0 +1,164 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from airflow.models import BaseOperator
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerOperator(BaseOperator):
+    """
+    Creates an AWS Glue Crawler. AWS Glue Crawler is a serverless
+    service for infering the schema, format and data type of data store on the AWS cloud.
+    Language support: Python
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_targets_configuration = Configurations for crawling AWS S3 paths
+    :type s3_targets_configuration = Optional[list]
+    :param jdbc_targets_configuration = Configurations for crawling JDBC paths
+    :type jdbc_targets_configuration = Optional[list]
+    :param mongo_targets_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_targets_configuration = Optional[list]
+    :param dynamo_targets_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_targets_configuration = Optional[list]
+    :param glue_catalog_targets_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_targets_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    :param overwrite = Determines if crawler should be updated if the crawler configuration change
+    :type overwrite = Optional[bool]
+    """
+
+    template_fields = ()
+    template_ext = ()
+    ui_color = '#ededed'
+
+    @apply_defaults
+    def __init__(
+        self,
+        *,
+        crawler_name='aws_glue_default_crawler',
+        crawler_desc='AWS Glue Crawler with Airflow',
+        aws_conn_id='aws_default',
+        glue_db_name=None,
+        iam_role_name=None,
+        region_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,

Review comment:
       in an operator, it does make sense to have the config as init params.
   but here i'd suggest consoldating to the more future-proof and simpler `crawler_kwargs`, a dict param

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,294 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_target_configuration = Configurations for crawling AWS S3 paths
+    :type s3_target_configuration = Optional[list]
+    :param jdbc_target_configuration = Configurations for crawling JDBC paths
+    :type jdbc_target_configuration = Optional[list]
+    :param mongo_target_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_target_configuration = Optional[list]
+    :param dynamo_target_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_target_configuration = Optional[list]
+    :param glue_catalog_target_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_target_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    :param overwrite = Determines if crawler should be updated if the crawler configuration change
+    :type overwrite = Optional[bool]
+    """
+
+    CRAWLER_POLL_INTERVAL = 6  # polls crawler status after every CRAWLER_POLL_INTERVAL seconds
+
+    def __init__(
+        self,
+        crawler_name=None,
+        crawler_desc=None,
+        glue_db_name=None,
+        iam_role_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,
+        update_behavior=None,
+        delete_behavior=None,
+        recrawl_behavior=None,
+        lineage_settings=None,
+        json_configuration=None,
+        security_configuration=None,
+        tags=None,
+        overwrite=False,
+        *args,
+        **kwargs,
+    ):
+
+        self.crawler_name = crawler_name
+        self.crawler_desc = crawler_desc
+        self.glue_db_name = glue_db_name
+        self.iam_role_name = iam_role_name
+        self.s3_targets_configuration = s3_targets_configuration
+        self.jdbc_targets_configuration = jdbc_targets_configuration
+        self.mongo_targets_configuration = mongo_targets_configuration
+        self.dynamo_targets_configuration = dynamo_targets_configuration
+        self.glue_catalog_targets_configuration = glue_catalog_targets_configuration
+        self.cron_schedule = cron_schedule
+        self.classifiers = classifiers
+        self.table_prefix = table_prefix
+        self.update_behavior = update_behavior
+        self.delete_behavior = delete_behavior
+        self.recrawl_behavior = recrawl_behavior
+        self.lineage_settings = lineage_settings
+        self.json_configuration = json_configuration
+        self.security_configuration = security_configuration
+        self.tags = tags
+        self.overwrite = overwrite
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    def list_crawlers(self) -> List:
+        """:return: Lists of Crawlers"""
+        conn = self.get_conn()
+        return conn.get_crawlers()
+
+    def get_iam_execution_role(self) -> Dict:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=self.iam_role_name)
+            self.log.info("Iam Role Name: %s", self.iam_role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def initialize_crawler(self):
+        """
+        Initializes connection with AWS Glue to run crawler
+        :return:
+        """
+        glue_client = self.get_conn()
+
+        try:
+            crawler_name = self.get_or_create_glue_crawler()
+            crawler_run = glue_client.start_crawler(Name=crawler_name)
+            return crawler_run
+        except Exception as general_error:
+            self.log.error("Failed to run aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_state = crawler_run['Crawler']['State']
+        return crawler_run_state
+
+    def get_crawler_status(self, crawler_name: str) -> str:

Review comment:
       `get_crawl_status` -- it's the status of the crawl i.e. the run not the state of the crawler
   
   this would help clarify diff between `get_crawler_status` and `get_crawler_state` which sound like the same thing
   
   you could even do `get_last_crawl_status` to be explicit

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,294 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_target_configuration = Configurations for crawling AWS S3 paths
+    :type s3_target_configuration = Optional[list]
+    :param jdbc_target_configuration = Configurations for crawling JDBC paths
+    :type jdbc_target_configuration = Optional[list]
+    :param mongo_target_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_target_configuration = Optional[list]
+    :param dynamo_target_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_target_configuration = Optional[list]
+    :param glue_catalog_target_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_target_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    :param overwrite = Determines if crawler should be updated if the crawler configuration change
+    :type overwrite = Optional[bool]
+    """
+
+    CRAWLER_POLL_INTERVAL = 6  # polls crawler status after every CRAWLER_POLL_INTERVAL seconds
+
+    def __init__(
+        self,
+        crawler_name=None,
+        crawler_desc=None,
+        glue_db_name=None,
+        iam_role_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,
+        update_behavior=None,
+        delete_behavior=None,
+        recrawl_behavior=None,
+        lineage_settings=None,
+        json_configuration=None,
+        security_configuration=None,
+        tags=None,
+        overwrite=False,
+        *args,
+        **kwargs,
+    ):
+
+        self.crawler_name = crawler_name
+        self.crawler_desc = crawler_desc
+        self.glue_db_name = glue_db_name
+        self.iam_role_name = iam_role_name
+        self.s3_targets_configuration = s3_targets_configuration
+        self.jdbc_targets_configuration = jdbc_targets_configuration
+        self.mongo_targets_configuration = mongo_targets_configuration
+        self.dynamo_targets_configuration = dynamo_targets_configuration
+        self.glue_catalog_targets_configuration = glue_catalog_targets_configuration
+        self.cron_schedule = cron_schedule
+        self.classifiers = classifiers
+        self.table_prefix = table_prefix
+        self.update_behavior = update_behavior
+        self.delete_behavior = delete_behavior
+        self.recrawl_behavior = recrawl_behavior
+        self.lineage_settings = lineage_settings
+        self.json_configuration = json_configuration
+        self.security_configuration = security_configuration
+        self.tags = tags
+        self.overwrite = overwrite
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    def list_crawlers(self) -> List:
+        """:return: Lists of Crawlers"""
+        conn = self.get_conn()
+        return conn.get_crawlers()
+
+    def get_iam_execution_role(self) -> Dict:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=self.iam_role_name)
+            self.log.info("Iam Role Name: %s", self.iam_role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def initialize_crawler(self):
+        """
+        Initializes connection with AWS Glue to run crawler
+        :return:
+        """
+        glue_client = self.get_conn()
+
+        try:
+            crawler_name = self.get_or_create_glue_crawler()
+            crawler_run = glue_client.start_crawler(Name=crawler_name)
+            return crawler_run
+        except Exception as general_error:
+            self.log.error("Failed to run aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_state = crawler_run['Crawler']['State']
+        return crawler_run_state
+
+    def get_crawler_status(self, crawler_name: str) -> str:
+        """
+        Get current status of the Glue crawler. The crawler
+        status can be succeeded, cancelled, or failed.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Status of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_status = crawler_run['Crawler']['LastCrawl']['Status']
+        return crawler_run_status
+
+    def crawler_completion(self, crawler_name: str) -> str:

Review comment:
       maybe `wait_for_crawler_completion` or `await_crawler_completion` or `await_completion` or `wait_for_completion` ---- something to indicated what it will do

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,294 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_target_configuration = Configurations for crawling AWS S3 paths
+    :type s3_target_configuration = Optional[list]
+    :param jdbc_target_configuration = Configurations for crawling JDBC paths
+    :type jdbc_target_configuration = Optional[list]
+    :param mongo_target_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_target_configuration = Optional[list]
+    :param dynamo_target_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_target_configuration = Optional[list]
+    :param glue_catalog_target_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_target_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    :param overwrite = Determines if crawler should be updated if the crawler configuration change
+    :type overwrite = Optional[bool]
+    """
+
+    CRAWLER_POLL_INTERVAL = 6  # polls crawler status after every CRAWLER_POLL_INTERVAL seconds
+
+    def __init__(
+        self,
+        crawler_name=None,
+        crawler_desc=None,
+        glue_db_name=None,
+        iam_role_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,
+        update_behavior=None,
+        delete_behavior=None,
+        recrawl_behavior=None,
+        lineage_settings=None,
+        json_configuration=None,
+        security_configuration=None,
+        tags=None,
+        overwrite=False,
+        *args,
+        **kwargs,
+    ):
+
+        self.crawler_name = crawler_name
+        self.crawler_desc = crawler_desc
+        self.glue_db_name = glue_db_name
+        self.iam_role_name = iam_role_name
+        self.s3_targets_configuration = s3_targets_configuration
+        self.jdbc_targets_configuration = jdbc_targets_configuration
+        self.mongo_targets_configuration = mongo_targets_configuration
+        self.dynamo_targets_configuration = dynamo_targets_configuration
+        self.glue_catalog_targets_configuration = glue_catalog_targets_configuration
+        self.cron_schedule = cron_schedule
+        self.classifiers = classifiers
+        self.table_prefix = table_prefix
+        self.update_behavior = update_behavior
+        self.delete_behavior = delete_behavior
+        self.recrawl_behavior = recrawl_behavior
+        self.lineage_settings = lineage_settings
+        self.json_configuration = json_configuration
+        self.security_configuration = security_configuration
+        self.tags = tags
+        self.overwrite = overwrite
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    def list_crawlers(self) -> List:
+        """:return: Lists of Crawlers"""
+        conn = self.get_conn()
+        return conn.get_crawlers()
+
+    def get_iam_execution_role(self) -> Dict:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=self.iam_role_name)
+            self.log.info("Iam Role Name: %s", self.iam_role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def initialize_crawler(self):
+        """
+        Initializes connection with AWS Glue to run crawler
+        :return:
+        """
+        glue_client = self.get_conn()
+
+        try:
+            crawler_name = self.get_or_create_glue_crawler()
+            crawler_run = glue_client.start_crawler(Name=crawler_name)
+            return crawler_run
+        except Exception as general_error:
+            self.log.error("Failed to run aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_state = crawler_run['Crawler']['State']
+        return crawler_run_state
+
+    def get_crawler_status(self, crawler_name: str) -> str:
+        """
+        Get current status of the Glue crawler. The crawler
+        status can be succeeded, cancelled, or failed.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Status of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_status = crawler_run['Crawler']['LastCrawl']['Status']
+        return crawler_run_status
+
+    def crawler_completion(self, crawler_name: str) -> str:
+        """
+        Waits until Glue crawler with crawler_name completes or
+        fails and returns final state if finished.
+        Raises AirflowException when the crawler failed
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Dict of crawler's status
+        """
+        failed_status = ['FAILED', 'CANCELLED']
+
+        while True:
+            crawler_run_state = self.get_crawler_state(crawler_name)
+            if crawler_run_state == 'READY':
+                self.log.info("Crawler: %s State: %s", crawler_name, crawler_run_state)
+                crawler_run_status = self.get_crawler_status(crawler_name)
+                if crawler_run_status in failed_status:
+                    crawler_error_message = (
+                        "Exiting Crawler: " + crawler_name + " Run State: " + crawler_run_state
+                    )
+                    self.log.info(crawler_error_message)
+                    raise AirflowException(crawler_error_message)
+                else:
+                    self.log.info("Crawler Status: %s", crawler_run_status)
+                    metrics = self.get_crawler_metrics(self.crawler_name)
+                    print('Last Runtime Duration (seconds): ', metrics['LastRuntimeSeconds'])
+                    print('Median Runtime Duration (seconds): ', metrics['MedianRuntimeSeconds'])
+                    print('Tables Created: ', metrics['TablesCreated'])
+                    print('Tables Updated: ', metrics['TablesUpdated'])
+                    print('Tables Deleted: ', metrics['TablesDeleted'])
+
+                    return {'Status': crawler_run_status}

Review comment:
       why not just return `crawler_run_status` instead of wrapping in dict?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r548388906



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,181 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def get_iam_execution_role(self, role_name) -> str:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=role_name)
+            self.log.info("Iam Role Name: %s", role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_or_create_crawler(self, config) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param config = Configurations for the AWS Glue crawler
+        :type config = Optional[dict]
+        :return: Name of the crawler
+        """
+        self.get_iam_execution_role(config["Role"])

Review comment:
       do you need to do anything with this role you are retrieving?  it seems this doesn't have an effect?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560145662



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name = IAM role name

Review comment:
       ```suggestion
           :param role_name: IAM role name
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560489585



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       i remember this discussion
   
   i would leave it with the original error, which already mentions role issue when that's what it is
   
   i think adding this only adds confusion.
   
   and the exception won't _always_ be due to role
   
   my 2 cents 🤦‍♂️
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] github-actions[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-745525128


   [The Workflow run](https://github.com/apache/airflow/actions/runs/424038097) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] boring-cyborg[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

boring-cyborg[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-766606970


   Awesome work, congrats on your first merged pull request!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560489585



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       i remember this discussion
   
   i would leave it with the original error, which already mentions a problem related to the role when that's the cause of the exception
   
   i think adding this bit of code only adds confusion.  it's kind of a backward side-effecty way of accompishing what you're trying to accomplish.  and the exception won't _always_ be due to role.
   
   if you really must tranlate the error, i would parse the error message for the specific language that appears in this case, and and if you find it, raise an informative message and in other circumstances just raise.  that would be more direct, clear, and explicit.
   
   
   my 2 cents  🤷‍♂️
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r561136681



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       After thinking about it, removing the `check_iam_role` method is the right move. As you said @dstandish, the InvalidInputException points the error to the role so that exception should be enough. If the user tries to debug why they got the trust policy error regarding the role, they would probably check if the role they're checking is the right role :).  I'm glad you brought it up, it was a good learning point for me 👍 . 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] feluelle commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

feluelle commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r559368275



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def get_or_create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create/update the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        try:
+            glue_response = self.glue_client.get_crawler(Name=crawler_name)
+            self.log.info("Crawler %s already exists; updating crawler", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.EntityNotFoundException:
+            self.log.info("Creating AWS Glue crawler %s", crawler_name)
+            try:
+                glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+                return glue_response['Crawler']['Name']
+            except self.glue_client.exceptions.InvalidInputException as general_error:
+                self.check_iam_role(crawler_kwargs['Role'])
+                raise AirflowException(general_error)

Review comment:
       Hmm actually do wee need to call `get_crawler` again if we know that it has the crawler? Can we not just take `crawler_name` (provided) or is this different from the returned one of `get_crawler`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560489585



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       i remember this discussion
   
   i would leave it with the original error, which already mentions a problem related to the role when that's the cause of the exception
   
   i think adding this only adds confusion.
   
   and the exception won't _always_ be due to role
   
   my 2 cents  🤷‍♂️
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r543792678



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,288 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List, Optional
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+
+    :param crawler_name: Unique crawler name per AWS Account
+    :type crawler_name: Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_target_configuration = Configurations for crawling AWS S3 paths
+    :type s3_target_configuration = Optional[list]
+    :param jdbc_target_configuration = Configurations for crawling JDBC paths
+    :type jdbc_target_configuration = Optional[list]
+    :param mongo_target_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_target_configuration = Optional[list]
+    :param dynamo_target_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_target_configuration = Optional[list]
+    :param glue_catalog_target_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_target_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    """
+
+    CRAWLER_POLL_INTERVAL = 6  # polls crawler status after every CRAWLER_POLL_INTERVAL seconds
+
+    def __init__(
+        self,
+        crawler_name=None,
+        crawler_desc=None,
+        glue_db_name=None,
+        iam_role_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,
+        update_behavior=None,
+        delete_behavior=None,
+        recrawl_behavior=None,
+        lineage_settings=None,
+        json_configuration=None,
+        security_configuration=None,
+        tags=None,
+        *args,
+        **kwargs,
+    ):
+
+        self.crawler_name = crawler_name
+        self.crawler_desc = crawler_desc
+        self.glue_db_name = glue_db_name
+        self.iam_role_name = iam_role_name
+        self.s3_targets_configuration = s3_targets_configuration
+        self.jdbc_targets_configuration = jdbc_targets_configuration
+        self.mongo_targets_configuration = mongo_targets_configuration
+        self.dynamo_targets_configuration = dynamo_targets_configuration
+        self.glue_catalog_targets_configuration = glue_catalog_targets_configuration
+        self.cron_schedule = cron_schedule
+        self.classifiers = classifiers
+        self.table_prefix = table_prefix
+        self.update_behavior = update_behavior
+        self.delete_behavior = delete_behavior
+        self.recrawl_behavior = recrawl_behavior
+        self.lineage_settings = lineage_settings
+        self.json_configuration = json_configuration
+        self.security_configuration = security_configuration
+        self.tags = tags
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    def list_crawlers(self) -> List:
+        ":return: Lists of Crawlers"
+        conn = self.get_conn()
+        return conn.get_crawlers()
+
+    def get_iam_execution_role(self) -> Dict:
+        ":return: iam role for crawler execution"
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=self.iam_role_name)
+            self.log.info("Iam Role Name: %s", self.iam_role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def initialize_crawler(self):
+        """
+        Initializes connection with AWS Glue to run crawler
+        :return:
+        """
+        glue_client = self.get_conn()
+
+        try:
+            crawler_name = self.get_or_create_glue_crawler()
+            crawler_run = glue_client.start_crawler(Name=crawler_name)
+            return crawler_run
+        except Exception as general_error:
+            self.log.error("Failed to run aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_state = crawler_run['Crawler']['State']
+        return crawler_run_state
+
+    def get_crawler_status(self, crawler_name: str) -> str:
+        """
+        Get current status of the Glue crawler. The crawler
+        status can be succeeded, cancelled, or failed.
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Status of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_status = crawler_run['Crawler']['LastCrawl']['Status']
+        return crawler_run_status
+
+    def crawler_completion(self, crawler_name: str) -> str:
+        """
+        Waits until Glue crawler with crawler_name completes or
+        fails and returns final state if finished.
+        Raises AirflowException when the crawler failed
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Dict of crawler's status
+        """
+        failed_status = ['FAILED', 'CANCELLED']
+
+        while True:
+            crawler_run_state = self.get_crawler_state(crawler_name)
+            if crawler_run_state == 'READY':
+                self.log.info("Crawler: %s State: %s", crawler_name, crawler_run_state)
+                crawler_run_status = self.get_crawler_status(crawler_name)
+                if crawler_run_status in failed_status:
+                    crawler_error_message = (
+                        "Exiting Crawler: " + crawler_name + " Run State: " + crawler_run_state
+                    )
+                    self.log.info(crawler_error_message)
+                    raise AirflowException(crawler_error_message)
+                else:
+                    self.log.info("Crawler Status: %s", crawler_run_status)
+                    metrics = self.get_crawler_metrics(self.crawler_name)
+                    print('Last Runtime Duration (seconds): ', metrics['LastRuntimeSeconds'])
+                    print('Median Runtime Duration (seconds): ', metrics['MedianRuntimeSeconds'])
+                    print('Tables Created: ', metrics['TablesCreated'])
+                    print('Tables Updated: ', metrics['TablesUpdated'])
+                    print('Tables Deleted: ', metrics['TablesDeleted'])
+                    
+                    return {'Status': crawler_run_status}
+                    

Review comment:
       ```suggestion
   
                       return {'Status': crawler_run_status}
   
   ```
   Whitespace




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] github-actions[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-745617961


   [The Workflow run](https://github.com/apache/airflow/actions/runs/424360315) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r543722769



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,282 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List, Optional
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account

Review comment:
       ```suggestion
   
       :param crawler_name: Unique crawler name per AWS Account
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r561138453



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       @feluelle Thanks for the input, it's always nice to minimize the number of requests :) 
   
   I'll push the changes sometime today




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] feluelle merged pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

feluelle merged pull request #13072:
URL: https://github.com/apache/airflow/pull/13072


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r559776829



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def get_or_create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create/update the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        try:
+            glue_response = self.glue_client.get_crawler(Name=crawler_name)
+            self.log.info("Crawler %s already exists; updating crawler", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.EntityNotFoundException:
+            self.log.info("Creating AWS Glue crawler %s", crawler_name)
+            try:
+                glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+                return glue_response['Crawler']['Name']
+            except self.glue_client.exceptions.InvalidInputException as general_error:
+                self.check_iam_role(crawler_kwargs['Role'])
+                raise AirflowException(general_error)

Review comment:
       Good point, no we don't. They are both the same. I'm thinking of deleting `get_crawler()` function from the hook entirely as it's not needed. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560497057



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       my intention is not to block this pr if no one else cares about this but just want to highlight the issue in case anyone does
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r551508779



##########
File path: airflow/providers/amazon/aws/operators/glue_crawler.py
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from cached_property import cached_property
+
+from airflow.models import BaseOperator
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerOperator(BaseOperator):
+    """
+    Creates, updates and triggers an AWS Glue Crawler. AWS Glue Crawler is a serverless
+    service that manages a catalog of metadata tables that contain the inferred
+    schema, format and data types of data stores within the AWS cloud.
+
+    :param config: Configurations for the AWS Glue crawler
+    :type config: dict
+    :param aws_conn_id: aws connection to use
+    :type aws_conn_id: Optional[str]
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: Optional[int]
+    """
+
+    ui_color = '#ededed'
+
+    @apply_defaults
+    def __init__(
+        self,
+        config,
+        aws_conn_id='aws_default',
+        poll_interval: int = 5,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.aws_conn_id = aws_conn_id
+        self.poll_interval = poll_interval
+        self.config = config
+
+    @cached_property
+    def hook(self) -> AwsGlueCrawlerHook:
+        """Create and return an AwsGlueCrawlerHook."""
+        return AwsGlueCrawlerHook(self.aws_conn_id)
+
+    def execute(self, context):
+        """
+        Executes AWS Glue Crawler from Airflow
+        :return: the name of the current glue crawler.
+        """
+        crawler_name = self.hook.get_or_create_crawler(**self.config)
+        self.log.info("Triggering AWS Glue Crawler")
+        self.hook.start_crawler(crawler_name)
+        self.log.info("Waiting for AWS Glue Crawler")
+        self.hook.wait_for_crawler_completion(crawler_name)

Review comment:
       Thanks! Currently working on it,  should have a revision up in ~30 mins.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560145761



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name = IAM role name
+        :type role_name = str

Review comment:
       ```suggestion
           :type role_name: str
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] feluelle commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

feluelle commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560738965



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       I would suggest to add @marshall7m comment there, sth. like:
   
   ```python
   try:
       ...
   except self.glue_client.exceptions.InvalidInputException:
       # <add-explanation-here>
       self.check_iam_role(crawler_kwargs['Role'])
       raise
   ```
   I would not raise `AirflowException` tbh. Personally I don't like using this type of exception if it actually is unrelated to airflow. Just re-raise the error in case it has nothing to do with the iam role. If it's because of the iam role it `check_iam_role` will raise a more accurate error.
   
   WDYT?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mschmo commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mschmo commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r551507672



##########
File path: airflow/providers/amazon/aws/operators/glue_crawler.py
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from cached_property import cached_property
+
+from airflow.models import BaseOperator
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerOperator(BaseOperator):
+    """
+    Creates, updates and triggers an AWS Glue Crawler. AWS Glue Crawler is a serverless
+    service that manages a catalog of metadata tables that contain the inferred
+    schema, format and data types of data stores within the AWS cloud.
+
+    :param config: Configurations for the AWS Glue crawler
+    :type config: dict
+    :param aws_conn_id: aws connection to use
+    :type aws_conn_id: Optional[str]
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: Optional[int]
+    """
+
+    ui_color = '#ededed'
+
+    @apply_defaults
+    def __init__(
+        self,
+        config,
+        aws_conn_id='aws_default',
+        poll_interval: int = 5,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.aws_conn_id = aws_conn_id
+        self.poll_interval = poll_interval
+        self.config = config
+
+    @cached_property
+    def hook(self) -> AwsGlueCrawlerHook:
+        """Create and return an AwsGlueCrawlerHook."""
+        return AwsGlueCrawlerHook(self.aws_conn_id)
+
+    def execute(self, context):
+        """
+        Executes AWS Glue Crawler from Airflow
+        :return: the name of the current glue crawler.
+        """
+        crawler_name = self.hook.get_or_create_crawler(**self.config)
+        self.log.info("Triggering AWS Glue Crawler")
+        self.hook.start_crawler(crawler_name)
+        self.log.info("Waiting for AWS Glue Crawler")
+        self.hook.wait_for_crawler_completion(crawler_name)

Review comment:
       I think we should pass `self.poll_interval` into this method call otherwise that instance attribute is useless.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mschmo commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mschmo commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r550312524



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,159 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval

Review comment:
       IMO `poll_interval` shouldn't be a required argument for the hook. It makes sense as an argument for the sensor, but the hook I think should be a bit more vague in how it can interact with the Glue API. For instance `poll_interval` is irrelevant if I want to initialize a hook solely to get or create a crawler, and not poll for its status.
   
   I believe it may be better to pass `poll_interval` in as an argument to the `wait_for_crawler_completion` method, which is the only place it's used atm.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560489585



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       i remember this discussion
   
   i would leave it with the original error, which already mentions role issue when that's what it is
   
   i think adding this only adds confusion.
   
   and the exception won't _always_ be due to role
   
   my 2 cents  🤷‍♂️
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r549739066



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,181 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def get_iam_execution_role(self, role_name) -> str:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=role_name)
+            self.log.info("Iam Role Name: %s", role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_or_create_crawler(self, config) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param config = Configurations for the AWS Glue crawler
+        :type config = Optional[dict]
+        :return: Name of the crawler
+        """
+        self.get_iam_execution_role(config["Role"])

Review comment:
       @dstandish I tested the glue API's create_crawler() method with an invalid role and the exception was misleading because it outputted: ```botocore.errorfactory.InvalidInputException: An error occurred (InvalidInputException) when calling the CreateCrawler operation: Service is unable to assume role arn:aws:iam::111111111111:role/test-foo-role. Please verify role's TrustPolicy```.  The `self.get_iam_execution_role()` outputted a more appropriate exception: ```botocore.errorfactory.NoSuchEntityException: An error occurred (NoSuchEntity) when calling the GetRole operation: The role with name test-foo-role cannot be found.```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560496932



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       Any thoughts on this? @mik-laj @mschmo @feluelle 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560295521



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs = Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs = any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])
+            raise AirflowException(general_error)
+
+    def start_crawler(self, crawler_name: str) -> dict:
+        """
+        Triggers the AWS Glue crawler
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Empty dictionary
+        """
+        crawler = self.glue_client.start_crawler(Name=crawler_name)
+        return crawler
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        crawler = self.glue_client.get_crawler(Name=crawler_name)
+        crawler_state = crawler['Crawler']['State']

Review comment:
       maybe remove `get_crawler_status()` and `get_crawler_state()` and add a `get_crawler()` function that returns the results of the `glue_client.get_crawler()`?  Then we can just index into the response to get the state/status within `wait_for_crawler_completion()`?
   
   something like this for `wait_for_crawler_completion()`:
   
   ```
   while True:
               crawler = self.get_crawler(crawler_name)
               crawler_state = crawler['Crawler']['State']
               if crawler_state == 'READY':
                   self.log.info("State: %s", crawler_state)
                   crawler_status = crawler['Crawler']['LastCrawl']['Status']
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] github-actions[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-748502329


   [The Workflow run](https://github.com/apache/airflow/actions/runs/432696908) is cancelling this PR. Building image for the PR has been cancelled


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r548396143



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,181 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def get_iam_execution_role(self, role_name) -> str:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=role_name)
+            self.log.info("Iam Role Name: %s", role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_or_create_crawler(self, config) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param config = Configurations for the AWS Glue crawler
+        :type config = Optional[dict]
+        :return: Name of the crawler
+        """
+        self.get_iam_execution_role(config["Role"])
+
+        try:
+            self.glue_client.get_crawler(**config)
+            self.log.info("Crawler already exists")
+            try:
+                self.glue_client.update_crawler(**config)
+                return config["Name"]
+            except Exception as general_error:
+                self.log.error("Failed to update aws glue crawler, error: %s", general_error)
+                raise
+        except self.glue_client.exceptions.EntityNotFoundException:
+            self.log.info("Creating AWS Glue crawler")
+            try:
+                self.glue_client.create_crawler(**config)
+                return config["Name"]
+            except Exception as general_error:
+                self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+                raise

Review comment:
       The nested trys get a bit hard to decipher.  
   
   ```suggestion
           crawler_name = config["Name"]
           try:
               self.glue_client.get_crawler(Name=crawler_name)
               self.log.info(f"Crawler '{crawler_name}' already exists; updating crawler.")
               self.glue_client.update_crawler(**config)
           except self.glue_client.exceptions.EntityNotFoundException:
               self.log.info("Creating AWS Glue crawler")
               self.glue_client.create_crawler(**config)
           return crawler_name
   ```
   
   I think it's better to remove the try except when you are just catching and immediately raising because it impairs readability without benefit --- it will be clear enough in the logs that the operation failed.

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,181 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def get_iam_execution_role(self, role_name) -> str:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=role_name)
+            self.log.info("Iam Role Name: %s", role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)

Review comment:
       message not correct for this method... but also ... to be a broken record :) ... not sure necessary to catch here

##########
File path: tests/providers/amazon/aws/hooks/test_glue_crawler.py
##########
@@ -0,0 +1,142 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import json
+import unittest
+from unittest import mock
+
+from cached_property import cached_property
+
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+
+try:
+    from moto import mock_iam
+except ImportError:
+    mock_iam = None
+
+mock_crawler_name = 'test-crawler'
+mock_role_name = 'test-role'
+mock_config = {
+    'Name': mock_crawler_name,
+    'Description': 'Test glue crawler from Airflow',
+    'DatabaseName': 'test_db',
+    'Role': mock_role_name,
+    'Targets': {
+        'S3Targets': [
+            {
+                'Path': 's3://test-glue-crawler/foo/',
+                'Exclusions': [
+                    's3://test-glue-crawler/bar/',
+                ],
+                'ConnectionName': 'test-s3-conn',
+            }
+        ],
+        'JdbcTargets': [
+            {
+                'ConnectionName': 'test-jdbc-conn',
+                'Path': 'test_db/test_table>',
+                'Exclusions': [
+                    'string',
+                ],
+            }
+        ],
+        'MongoDBTargets': [
+            {'ConnectionName': 'test-mongo-conn', 'Path': 'test_db/test_collection', 'ScanAll': True}
+        ],
+        'DynamoDBTargets': [{'Path': 'test_db/test_table', 'scanAll': True, 'scanRate': 123.0}],
+        'CatalogTargets': [
+            {
+                'DatabaseName': 'test_glue_db',
+                'Tables': [
+                    'test',
+                ],
+            }
+        ],
+    },
+    'Classifiers': ['test-classifier'],
+    'TablePrefix': 'test',
+    'SchemaChangePolicy': {
+        'UpdateBehavior': 'UPDATE_IN_DATABASE',
+        'DeleteBehavior': 'DEPRECATE_IN_DATABASE',
+    },
+    'RecrawlPolicy': {'RecrawlBehavior': 'CRAWL_EVERYTHING'},
+    'LineageConfiguration': 'ENABLE',
+    'Configuration': """
+    {
+        "Version": 1.0,
+        "CrawlerOutput": {
+            "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
+        }
+    }
+    """,
+    'SecurityConfiguration': 'test',
+    'Tags': {'test': 'foo'},
+}
+
+
+class TestAwsGlueCrawlerHook(unittest.TestCase):
+    @cached_property
+    def setUp(self):
+        self.hook = AwsGlueCrawlerHook(aws_conn_id="aws_default", poll_interval=5)
+
+    @unittest.skipIf(mock_iam is None, 'mock_iam package not present')
+    @mock_iam
+    def test_get_iam_execution_role(self):
+        iam_role = self.hook.get_client_type('iam').create_role(
+            Path="/",
+            RoleName=mock_role_name,
+            AssumeRolePolicyDocument=json.dumps(
+                {
+                    "Version": "2012-10-17",
+                    "Statement": {
+                        "Effect": "Allow",
+                        "Principal": {"Service": "glue.amazonaws.com"},
+                        "Action": "sts:AssumeRole",
+                    },
+                }
+            ),
+        )
+        iam_role = self.hook.get_iam_execution_role(role_name=mock_role_name)
+
+        self.assertIsNotNone(iam_role)
+
+    @mock.patch.object(AwsGlueCrawlerHook, "get_iam_execution_role")
+    @mock.patch.object(AwsGlueCrawlerHook, "get_conn")
+    def test_get_or_create_crawler(self, mock_get_conn, mock_get_iam_execution_role):
+        mock_get_iam_execution_role.return_value = mock.MagicMock(Role={'RoleName': mock_role_name})

Review comment:
       does this have any effect?

##########
File path: tests/providers/amazon/aws/hooks/test_glue_crawler.py
##########
@@ -0,0 +1,142 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import json
+import unittest
+from unittest import mock
+
+from cached_property import cached_property
+
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+
+try:
+    from moto import mock_iam
+except ImportError:
+    mock_iam = None
+
+mock_crawler_name = 'test-crawler'
+mock_role_name = 'test-role'
+mock_config = {
+    'Name': mock_crawler_name,
+    'Description': 'Test glue crawler from Airflow',
+    'DatabaseName': 'test_db',
+    'Role': mock_role_name,
+    'Targets': {
+        'S3Targets': [
+            {
+                'Path': 's3://test-glue-crawler/foo/',
+                'Exclusions': [
+                    's3://test-glue-crawler/bar/',
+                ],
+                'ConnectionName': 'test-s3-conn',
+            }
+        ],
+        'JdbcTargets': [
+            {
+                'ConnectionName': 'test-jdbc-conn',
+                'Path': 'test_db/test_table>',
+                'Exclusions': [
+                    'string',
+                ],
+            }
+        ],
+        'MongoDBTargets': [
+            {'ConnectionName': 'test-mongo-conn', 'Path': 'test_db/test_collection', 'ScanAll': True}
+        ],
+        'DynamoDBTargets': [{'Path': 'test_db/test_table', 'scanAll': True, 'scanRate': 123.0}],
+        'CatalogTargets': [
+            {
+                'DatabaseName': 'test_glue_db',
+                'Tables': [
+                    'test',
+                ],
+            }
+        ],
+    },
+    'Classifiers': ['test-classifier'],
+    'TablePrefix': 'test',
+    'SchemaChangePolicy': {
+        'UpdateBehavior': 'UPDATE_IN_DATABASE',
+        'DeleteBehavior': 'DEPRECATE_IN_DATABASE',
+    },
+    'RecrawlPolicy': {'RecrawlBehavior': 'CRAWL_EVERYTHING'},
+    'LineageConfiguration': 'ENABLE',
+    'Configuration': """
+    {
+        "Version": 1.0,
+        "CrawlerOutput": {
+            "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
+        }
+    }
+    """,
+    'SecurityConfiguration': 'test',
+    'Tags': {'test': 'foo'},
+}
+
+
+class TestAwsGlueCrawlerHook(unittest.TestCase):
+    @cached_property
+    def setUp(self):
+        self.hook = AwsGlueCrawlerHook(aws_conn_id="aws_default", poll_interval=5)
+
+    @unittest.skipIf(mock_iam is None, 'mock_iam package not present')
+    @mock_iam
+    def test_get_iam_execution_role(self):
+        iam_role = self.hook.get_client_type('iam').create_role(
+            Path="/",
+            RoleName=mock_role_name,
+            AssumeRolePolicyDocument=json.dumps(
+                {
+                    "Version": "2012-10-17",
+                    "Statement": {
+                        "Effect": "Allow",
+                        "Principal": {"Service": "glue.amazonaws.com"},
+                        "Action": "sts:AssumeRole",
+                    },
+                }
+            ),
+        )
+        iam_role = self.hook.get_iam_execution_role(role_name=mock_role_name)
+
+        self.assertIsNotNone(iam_role)

Review comment:
       perhaps it would be just as easy to verify that the role you mocked is the role retrieved (which is slightly better than just confirming that the value is not None)

##########
File path: airflow/providers/amazon/aws/operators/glue_crawler.py
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from cached_property import cached_property
+
+from airflow.models import BaseOperator
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerOperator(BaseOperator):
+    """
+    Creates, updates and triggers an AWS Glue Crawler. AWS Glue Crawler is a serverless
+    service that manages a catalog of metadata tables that contain the inferred
+    schema, format and data types of data stores within the AWS cloud.
+
+    :param aws_conn_id: aws connection to use
+    :type aws_conn_id: str
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]

Review comment:
       not optional

##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,181 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def get_iam_execution_role(self, role_name) -> str:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=role_name)
+            self.log.info("Iam Role Name: %s", role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_or_create_crawler(self, config) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param config = Configurations for the AWS Glue crawler
+        :type config = Optional[dict]
+        :return: Name of the crawler
+        """
+        self.get_iam_execution_role(config["Role"])
+
+        try:
+            self.glue_client.get_crawler(**config)
+            self.log.info("Crawler already exists")
+            try:
+                self.glue_client.update_crawler(**config)
+                return config["Name"]
+            except Exception as general_error:
+                self.log.error("Failed to update aws glue crawler, error: %s", general_error)
+                raise
+        except self.glue_client.exceptions.EntityNotFoundException:
+            self.log.info("Creating AWS Glue crawler")
+            try:
+                self.glue_client.create_crawler(**config)
+                return config["Name"]
+            except Exception as general_error:
+                self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+                raise
+
+    def start_crawler(self, crawler_name: str) -> str:
+        """
+        Triggers the AWS Glue crawler
+        :return: Empty dictionary
+        """
+        try:
+            crawler_run = self.glue_client.start_crawler(Name=crawler_name)
+            return crawler_run
+        except Exception as general_error:

Review comment:
       again i'm not sure if it's worth catching only to raise...  are the exceptions you get from api not clear enough?
   

##########
File path: airflow/providers/amazon/aws/operators/glue_crawler.py
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from cached_property import cached_property
+
+from airflow.models import BaseOperator
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerOperator(BaseOperator):
+    """
+    Creates, updates and triggers an AWS Glue Crawler. AWS Glue Crawler is a serverless
+    service that manages a catalog of metadata tables that contain the inferred
+    schema, format and data types of data stores within the AWS cloud.
+
+    :param aws_conn_id: aws connection to use
+    :type aws_conn_id: str
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]
+    """
+
+    template_fields = ()

Review comment:
       if you aren't changing these you can just inherit
   but does it make sense to template the `config` param?

##########
File path: tests/providers/amazon/aws/hooks/test_glue_crawler.py
##########
@@ -0,0 +1,142 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import json
+import unittest
+from unittest import mock
+
+from cached_property import cached_property
+
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+
+try:
+    from moto import mock_iam
+except ImportError:
+    mock_iam = None
+
+mock_crawler_name = 'test-crawler'
+mock_role_name = 'test-role'
+mock_config = {
+    'Name': mock_crawler_name,
+    'Description': 'Test glue crawler from Airflow',
+    'DatabaseName': 'test_db',
+    'Role': mock_role_name,
+    'Targets': {
+        'S3Targets': [
+            {
+                'Path': 's3://test-glue-crawler/foo/',
+                'Exclusions': [
+                    's3://test-glue-crawler/bar/',
+                ],
+                'ConnectionName': 'test-s3-conn',
+            }
+        ],
+        'JdbcTargets': [
+            {
+                'ConnectionName': 'test-jdbc-conn',
+                'Path': 'test_db/test_table>',
+                'Exclusions': [
+                    'string',
+                ],
+            }
+        ],
+        'MongoDBTargets': [
+            {'ConnectionName': 'test-mongo-conn', 'Path': 'test_db/test_collection', 'ScanAll': True}
+        ],
+        'DynamoDBTargets': [{'Path': 'test_db/test_table', 'scanAll': True, 'scanRate': 123.0}],
+        'CatalogTargets': [
+            {
+                'DatabaseName': 'test_glue_db',
+                'Tables': [
+                    'test',
+                ],
+            }
+        ],
+    },
+    'Classifiers': ['test-classifier'],
+    'TablePrefix': 'test',
+    'SchemaChangePolicy': {
+        'UpdateBehavior': 'UPDATE_IN_DATABASE',
+        'DeleteBehavior': 'DEPRECATE_IN_DATABASE',
+    },
+    'RecrawlPolicy': {'RecrawlBehavior': 'CRAWL_EVERYTHING'},
+    'LineageConfiguration': 'ENABLE',
+    'Configuration': """
+    {
+        "Version": 1.0,
+        "CrawlerOutput": {
+            "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
+        }
+    }
+    """,
+    'SecurityConfiguration': 'test',
+    'Tags': {'test': 'foo'},
+}
+
+
+class TestAwsGlueCrawlerHook(unittest.TestCase):
+    @cached_property
+    def setUp(self):
+        self.hook = AwsGlueCrawlerHook(aws_conn_id="aws_default", poll_interval=5)
+
+    @unittest.skipIf(mock_iam is None, 'mock_iam package not present')
+    @mock_iam
+    def test_get_iam_execution_role(self):
+        iam_role = self.hook.get_client_type('iam').create_role(
+            Path="/",
+            RoleName=mock_role_name,
+            AssumeRolePolicyDocument=json.dumps(
+                {
+                    "Version": "2012-10-17",
+                    "Statement": {
+                        "Effect": "Allow",
+                        "Principal": {"Service": "glue.amazonaws.com"},
+                        "Action": "sts:AssumeRole",
+                    },
+                }
+            ),
+        )
+        iam_role = self.hook.get_iam_execution_role(role_name=mock_role_name)
+
+        self.assertIsNotNone(iam_role)
+
+    @mock.patch.object(AwsGlueCrawlerHook, "get_iam_execution_role")
+    @mock.patch.object(AwsGlueCrawlerHook, "get_conn")
+    def test_get_or_create_crawler(self, mock_get_conn, mock_get_iam_execution_role):
+        mock_get_iam_execution_role.return_value = mock.MagicMock(Role={'RoleName': mock_role_name})
+
+        mock_glue_crawler = mock_get_conn.return_value.get_crawler()['Crawler']['Name']

Review comment:
       i think what you are trying to do is mock get_crawler to return  some value, and then confirm that get_or_create returns this same value.  but here you are not correctly mocking a return value -- you are just setting `mock_glue_crawler` equal to a mock object `<MagicMock name='get_conn().get_crawler()[41 chars]936'>`

##########
File path: airflow/providers/amazon/aws/operators/glue_crawler.py
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from cached_property import cached_property
+
+from airflow.models import BaseOperator
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerOperator(BaseOperator):
+    """
+    Creates, updates and triggers an AWS Glue Crawler. AWS Glue Crawler is a serverless
+    service that manages a catalog of metadata tables that contain the inferred
+    schema, format and data types of data stores within the AWS cloud.
+
+    :param aws_conn_id: aws connection to use
+    :type aws_conn_id: str
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]
+    """
+
+    template_fields = ()
+    template_ext = ()
+    ui_color = '#ededed'
+
+    @apply_defaults
+    def __init__(
+        self,
+        config,
+        aws_conn_id='aws_default',
+        poll_interval: int = 5,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.aws_conn_id = (aws_conn_id,)

Review comment:
       should be string not tuple

##########
File path: airflow/providers/amazon/aws/sensors/glue_crawler.py
##########
@@ -0,0 +1,54 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.sensors.base import BaseSensorOperator
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerSensor(BaseSensorOperator):
+    """
+    Waits for an AWS Glue crawler to reach any of the statuses below
+    'FAILED', 'CANCELLED', 'SUCCEEDED'
+    :param crawler_name: The AWS Glue crawler unique name
+    :type crawler_name: str
+    """
+
+    template_fields = 'crawler_name'
+
+    @apply_defaults
+    def __init__(self, *, crawler_name: str, aws_conn_id: str = 'aws_default', **kwargs):
+        super().__init__(**kwargs)
+        self.crawler_name = crawler_name
+        self.aws_conn_id = aws_conn_id
+        self.success_statuses = ['SUCCEEDED']
+        self.errored_statuses = ['FAILED', 'CANCELLED']

Review comment:
       these probably make more sense as class constants

##########
File path: tests/providers/amazon/aws/sensors/test_glue_crawler.py
##########
@@ -0,0 +1,60 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import unittest
+from unittest import mock
+
+from airflow import configuration
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.providers.amazon.aws.sensors.glue_crawler import AwsGlueCrawlerSensor
+
+
+class TestAwsGlueCrawlerSensor(unittest.TestCase):
+    def setUp(self):
+        configuration.load_test_config()
+
+    @mock.patch.object(AwsGlueCrawlerHook, 'get_conn')
+    @mock.patch.object(AwsGlueCrawlerHook, 'get_crawler_state')
+    def test_poke(self, mock_get_crawler_state, mock_conn):

Review comment:
       you could consider combining these, parameterizing for the various statuses and their expected outcome




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r543723940



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,282 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List, Optional
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_target_configuration = Configurations for crawling AWS S3 paths
+    :type s3_target_configuration = Optional[list]
+    :param jdbc_target_configuration = Configurations for crawling JDBC paths
+    :type jdbc_target_configuration = Optional[list]
+    :param mongo_target_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_target_configuration = Optional[list]
+    :param dynamo_target_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_target_configuration = Optional[list]
+    :param glue_catalog_target_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_target_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    """
+
+    CRAWLER_POLL_INTERVAL = 6  # polls crawler status after every CRAWLER_POLL_INTERVAL seconds
+
+    def __init__(
+        self,
+        crawler_name=None,
+        crawler_desc=None,
+        glue_db_name=None,
+        iam_role_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,
+        update_behavior=None,
+        delete_behavior=None,
+        recrawl_behavior=None,
+        lineage_settings=None,
+        json_configuration=None,
+        security_configuration=None,
+        tags=None,
+        *args,
+        **kwargs,
+    ):
+
+        self.crawler_name = crawler_name
+        self.crawler_desc = crawler_desc
+        self.glue_db_name = glue_db_name
+        self.iam_role_name = iam_role_name
+        self.s3_targets_configuration = s3_targets_configuration
+        self.jdbc_targets_configuration = jdbc_targets_configuration
+        self.mongo_targets_configuration = mongo_targets_configuration
+        self.dynamo_targets_configuration = dynamo_targets_configuration
+        self.glue_catalog_targets_configuration = glue_catalog_targets_configuration
+        self.cron_schedule = cron_schedule
+        self.classifiers = classifiers
+        self.table_prefix = table_prefix
+        self.update_behavior = update_behavior
+        self.delete_behavior = delete_behavior
+        self.recrawl_behavior = recrawl_behavior
+        self.lineage_settings = lineage_settings
+        self.json_configuration = json_configuration
+        self.security_configuration = security_configuration
+        self.tags = tags
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    def list_crawlers(self) -> List:
+        ":return: Lists of Crawlers"
+        conn = self.get_conn()
+        return conn.get_crawlers()
+
+    def get_iam_execution_role(self) -> Dict:
+        ":return: iam role for crawler execution"
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=self.iam_role_name)
+            self.log.info("Iam Role Name: %s", self.iam_role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def initialize_crawler(self):
+        """
+        Initializes connection with AWS Glue to run crawler
+        :return:
+        """
+        glue_client = self.get_conn()
+
+        try:
+            crawler_name = self.get_or_create_glue_crawler()
+            crawler_run = glue_client.start_crawler(Name=crawler_name)
+            return crawler_run
+        except Exception as general_error:
+            self.log.error("Failed to run aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_state = crawler_run['Crawler']['State']
+        return crawler_run_state
+
+    def get_crawler_status(self, crawler_name: str) -> str:
+        """
+        Get current status of the Glue crawler. The crawler
+        status can be succeeded, cancelled, or failed.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Status of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_status = crawler_run['Crawler']['LastCrawl']['Status']
+        return crawler_run_status
+
+    def crawler_completion(self, crawler_name: str) -> str:
+        """
+        Waits until Glue crawler with crawler_name completes or
+        fails and returns final state if finished.
+        Raises AirflowException when the crawler failed
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Dict of crawler's status
+        """
+        failed_status = ['FAILED', 'CANCELLED']
+
+        while True:
+            crawler_run_state = self.get_crawler_state(crawler_name)
+            if crawler_run_state == 'READY':
+                self.log.info("Crawler: %s State: %s", crawler_name, crawler_run_state)
+                crawler_run_status = self.get_crawler_status(crawler_name)
+                if crawler_run_status in failed_status:
+                    crawler_error_message = (
+                        "Exiting Crawler: " + crawler_name + " Run State: " + crawler_run_state
+                    )
+                    self.log.info(crawler_error_message)
+                    raise AirflowException(crawler_error_message)
+                else:
+                    self.log.info("Crawler Status: %s", crawler_run_status)
+                    metrics = self.get_crawler_metrics(self.crawler_name)
+                    print('Last Runtime Duration (seconds): ', metrics['LastRuntimeSeconds'])
+                    print('Median Runtime Duration (seconds): ', metrics['MedianRuntimeSeconds'])
+                    print('Tables Created: ', metrics['TablesCreated'])
+                    print('Tables Updated: ', metrics['TablesUpdated'])
+                    print('Tables Deleted: ', metrics['TablesDeleted'])
+                    
+                    return {'Status': crawler_run_status}
+                    
+            else:
+                self.log.info(
+                    "Polling for AWS Glue crawler: %s Current run state: %s",
+                    self.crawler_name,
+                    crawler_run_state,
+                )
+                time.sleep(self.CRAWLER_POLL_INTERVAL)
+
+                metrics = self.get_crawler_metrics(self.crawler_name)
+                time_left = int(metrics['TimeLeftSeconds'])
+                
+                if time_left > 0:
+                    print('Estimated Time Left (seconds): ', time_left)
+                    self.CRAWLER_POLL_INTERVAL = time_left
+                else:
+                    print('Crawler should finish soon')
+
+    def get_or_create_glue_crawler(self) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+        :return:Name of the crawler
+        """
+        glue_client = self.get_conn()
+        try:
+            get_crawler_response = glue_client.get_crawler(Name=self.crawler_name)
+            self.log.info("Crawler already exists: %s", get_crawler_response['Crawler']['Name'])
+            return get_crawler_response['Crawler']['Name']
+            # TODO: update crawler with `glue_client.update_crawler()` if task crawler config don't match with existing crawler config
+
+        except glue_client.exceptions.EntityNotFoundException:
+            self.log.info("Crawler doesn't exist. Creating AWS Glue crawler")
+            execution_role = self.get_iam_execution_role()
+            try:
+                create_crawler_response = glue_client.create_crawler(
+                    Name=self.crawler_name,
+                    Role=execution_role['Role']['RoleName'],
+                    DatabaseName=self.glue_db_name,
+                    Description=self.crawler_desc,
+                    Targets={
+                        'S3Targets': self.s3_targets_configuration,
+                        'JdbcTargets': self.jdbc_targets_configuration,
+                        'MongoDBTargets': self.mongo_targets_configuration,
+                        'DynamoDBTargets': self.dynamo_targets_configuration,
+                        'CatalogTargets': self.glue_catalog_targets_configuration,
+                    },
+                    Schedule=self.cron_schedule,
+                    Classifiers=self.classifiers,
+                    TablePrefix=self.table_prefix,
+                    SchemaChangePolicy={
+                        'UpdateBehavior': self.update_behavior,
+                        'DeleteBehavior': self.delete_behavior,
+                    },
+                    RecrawlPolicy={'RecrawlBehavior': self.recrawl_behavior},
+                    LineageConfiguration={'CrawlerLineageSettings': self.lineage_settings},
+                    Configuration=self.json_configuration,
+                    CrawlerSecurityConfiguration=self.security_configuration,
+                    Tags=self.tags,
+                )
+                return self.crawler_name
+            except Exception as general_error:
+                self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+                raise
+
+    def get_crawler_metrics(self, crawler_name):
+        """
+        Prints crawl runtime and glue catalog table metrics associated with the crawler
+        :return:Dictionary of all the crawler metrics

Review comment:
       ```suggestion
   
           :return: Dictionary of all the crawler metrics
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560295521



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs = Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs = any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])
+            raise AirflowException(general_error)
+
+    def start_crawler(self, crawler_name: str) -> dict:
+        """
+        Triggers the AWS Glue crawler
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Empty dictionary
+        """
+        crawler = self.glue_client.start_crawler(Name=crawler_name)
+        return crawler
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        crawler = self.glue_client.get_crawler(Name=crawler_name)
+        crawler_state = crawler['Crawler']['State']

Review comment:
       maybe remove `get_crawler_status()` and `get_crawler_state()` and add a `get_crawler()` function that returns the results of the `glue_client.get_crawler()` and then just index into the response to get the state/status within `wait_for_crawler_completion()`?
   
   something like this within `wait_for_crawler_completion()`:
   
   ```
   while True:
               crawler = self.get_crawler(crawler_name)
               crawler_state = crawler['Crawler']['State']
               if crawler_state == 'READY':
                   self.log.info("State: %s", crawler_state)
                   crawler_status = crawler['Crawler']['LastCrawl']['Status']
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r549787437



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,181 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: int
+    :param config = Configurations for the AWS Glue crawler
+    :type config = Optional[dict]
+    """
+
+    def __init__(self, poll_interval, *args, **kwargs):
+        self.poll_interval = poll_interval
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def get_iam_execution_role(self, role_name) -> str:
+        """:return: iam role for crawler execution"""
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=role_name)
+            self.log.info("Iam Role Name: %s", role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_or_create_crawler(self, config) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param config = Configurations for the AWS Glue crawler
+        :type config = Optional[dict]
+        :return: Name of the crawler
+        """
+        self.get_iam_execution_role(config["Role"])

Review comment:
       personally, i would get rid of this method, and just let it fail at get_crawer.  the outcome is the same whether you check role existence first or skip the check --- the job fails, with a message about the role.
   
   including the method suggests that it needs to be there, so it will be unnecessarily confusing.
   
   if you just want to translate an error message, you could catch the error in the appropriate spot and translate it.
   
   but i don't think that's necessary because the error gives enough info.
   
   but that's just me :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r556723139



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def get_or_create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create/update the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        try:
+            glue_response = self.glue_client.get_crawler(Name=crawler_name)
+            self.log.info("Crawler %s already exists; updating crawler", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.EntityNotFoundException:
+            self.log.info("Creating AWS Glue crawler %s", crawler_name)
+            try:
+                glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+                return glue_response['Crawler']['Name']
+            except self.glue_client.exceptions.InvalidInputException as general_error:
+                self.check_iam_role(crawler_kwargs['Role'])
+                raise AirflowException(general_error)

Review comment:
       Maybe the operator should be an if/else since the `get_crawler()` function will also have the Glue API's `update_crawler()` function in it and it would be unnecessary to call `update_crawler()` after running `create_crawler()`
   
   For example:
   ```
   if not self.hook.has_crawler(**self.config):
       crawler_name = self.hook.create_crawler(**self.config)
   else:
       crawler_name = self.hook.get_crawler(**self.config)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560319861



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       i am not sure why you would want to check iam role if you're gonna raise anyway?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r543723690



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,282 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List, Optional
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_target_configuration = Configurations for crawling AWS S3 paths
+    :type s3_target_configuration = Optional[list]
+    :param jdbc_target_configuration = Configurations for crawling JDBC paths
+    :type jdbc_target_configuration = Optional[list]
+    :param mongo_target_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_target_configuration = Optional[list]
+    :param dynamo_target_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_target_configuration = Optional[list]
+    :param glue_catalog_target_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_target_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    """
+
+    CRAWLER_POLL_INTERVAL = 6  # polls crawler status after every CRAWLER_POLL_INTERVAL seconds
+
+    def __init__(
+        self,
+        crawler_name=None,
+        crawler_desc=None,
+        glue_db_name=None,
+        iam_role_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,
+        update_behavior=None,
+        delete_behavior=None,
+        recrawl_behavior=None,
+        lineage_settings=None,
+        json_configuration=None,
+        security_configuration=None,
+        tags=None,
+        *args,
+        **kwargs,
+    ):
+
+        self.crawler_name = crawler_name
+        self.crawler_desc = crawler_desc
+        self.glue_db_name = glue_db_name
+        self.iam_role_name = iam_role_name
+        self.s3_targets_configuration = s3_targets_configuration
+        self.jdbc_targets_configuration = jdbc_targets_configuration
+        self.mongo_targets_configuration = mongo_targets_configuration
+        self.dynamo_targets_configuration = dynamo_targets_configuration
+        self.glue_catalog_targets_configuration = glue_catalog_targets_configuration
+        self.cron_schedule = cron_schedule
+        self.classifiers = classifiers
+        self.table_prefix = table_prefix
+        self.update_behavior = update_behavior
+        self.delete_behavior = delete_behavior
+        self.recrawl_behavior = recrawl_behavior
+        self.lineage_settings = lineage_settings
+        self.json_configuration = json_configuration
+        self.security_configuration = security_configuration
+        self.tags = tags
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    def list_crawlers(self) -> List:
+        ":return: Lists of Crawlers"
+        conn = self.get_conn()
+        return conn.get_crawlers()
+
+    def get_iam_execution_role(self) -> Dict:
+        ":return: iam role for crawler execution"
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=self.iam_role_name)
+            self.log.info("Iam Role Name: %s", self.iam_role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def initialize_crawler(self):
+        """
+        Initializes connection with AWS Glue to run crawler
+        :return:
+        """
+        glue_client = self.get_conn()
+
+        try:
+            crawler_name = self.get_or_create_glue_crawler()
+            crawler_run = glue_client.start_crawler(Name=crawler_name)
+            return crawler_run
+        except Exception as general_error:
+            self.log.error("Failed to run aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_state = crawler_run['Crawler']['State']
+        return crawler_run_state
+
+    def get_crawler_status(self, crawler_name: str) -> str:
+        """
+        Get current status of the Glue crawler. The crawler
+        status can be succeeded, cancelled, or failed.
+        :param crawler_name: unique crawler name per AWS account

Review comment:
       ```suggestion
   
           :param crawler_name: unique crawler name per AWS account
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560486568



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       As mentioned in the `check_iam_role()` docs, the Glue API's `create_crawler()` function doesn't catch non-existing roles properly and throws an inaccurate trust policy error. So the purpose of the`check_iam_role()` is to throw the IAM client's non-existing role error instead of the inaccurate trust policy error. 
   
   But your comment does remind me that I should change the `check_iam_role()` to throw an AirflowException with the non-existing role error so omits the `AirflowException(general_error)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] github-actions[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-748502241


   [The Workflow run](https://github.com/apache/airflow/actions/runs/432715188) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560149373



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs = Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs = any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])
+            raise AirflowException(general_error)
+
+    def start_crawler(self, crawler_name: str) -> dict:
+        """
+        Triggers the AWS Glue crawler
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Empty dictionary
+        """
+        crawler = self.glue_client.start_crawler(Name=crawler_name)
+        return crawler
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        crawler = self.glue_client.get_crawler(Name=crawler_name)
+        crawler_state = crawler['Crawler']['State']

Review comment:
       Are we sure we need separate methods for one field? Can't we generalize it more?  During one iteration of the loop within the `wait_for_crawler_completion ` function, the crawler is fetched 3 times. It's a bit of a lot.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] feluelle commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

feluelle commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560744852



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       > in testing we saw that when InvalidInputException is raised due to role, it mentions the role in the error. so it seems pointless to me.
   
   Got it. If that's really the case then we do not need and probably should not check the iam role again. We save a request :) Only if it does not mention the role it makes sense.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mschmo commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mschmo commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r550318831



##########
File path: airflow/providers/amazon/aws/operators/glue_crawler.py
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from cached_property import cached_property
+
+from airflow.models import BaseOperator
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+from airflow.utils.decorators import apply_defaults
+
+
+class AwsGlueCrawlerOperator(BaseOperator):
+    """
+    Creates, updates and triggers an AWS Glue Crawler. AWS Glue Crawler is a serverless
+    service that manages a catalog of metadata tables that contain the inferred
+    schema, format and data types of data stores within the AWS cloud.
+
+    :param config: Configurations for the AWS Glue crawler
+    :type config: dict
+    :param aws_conn_id: aws connection to use
+    :type aws_conn_id: Optional[str]
+    :param poll_interval: Time (in seconds) to wait between two consecutive calls to check crawler status
+    :type poll_interval: Optional[int]
+    """
+
+    ui_color = '#ededed'
+
+    @apply_defaults
+    def __init__(
+        self,
+        config,
+        aws_conn_id='aws_default',
+        poll_interval: int = 5,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.aws_conn_id = aws_conn_id
+        self.poll_interval = poll_interval
+        self.config = config
+
+    @cached_property
+    def hook(self) -> AwsGlueCrawlerHook:
+        """Create and return an AwsGlueCrawlerHook."""
+        return AwsGlueCrawlerHook(self.aws_conn_id)

Review comment:
       And remember that `AwsGlueCrawlerHook`'s first argument atm is `poll_interval` - so this is actually a bug. But again, I think `poll_interval` should be removed from the `__init__` all together.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560486568



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name: IAM role name
+        :type role_name: str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs: Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs: any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])

Review comment:
       As mentioned in the `check_iam_role()` docs, the Glue API's `create_crawler()` function doesn't catch non-existing roles properly and throws an inaccurate trust policy error. So the purpose of the`check_iam_role()` is to throw the IAM client's non-existing role error instead of the inaccurate trust policy error. 
   
   But your comment does remind me that I should change the `check_iam_role()` to throw an AirflowException with the non-existing role error so `AirflowException(general_error)` is omitted.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] dstandish commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

dstandish commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r548387952



##########
File path: tests/providers/amazon/aws/hooks/test_glue_crawler.py
##########
@@ -0,0 +1,142 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import json
+import unittest
+from unittest import mock
+
+from cached_property import cached_property
+
+from airflow.providers.amazon.aws.hooks.glue_crawler import AwsGlueCrawlerHook
+
+try:
+    from moto import mock_iam
+except ImportError:
+    mock_iam = None
+
+mock_crawler_name = 'test-crawler'
+mock_role_name = 'test-role'
+mock_config = {
+    'Name': mock_crawler_name,
+    'Description': 'Test glue crawler from Airflow',
+    'DatabaseName': 'test_db',
+    'Role': mock_role_name,
+    'Targets': {
+        'S3Targets': [
+            {
+                'Path': 's3://test-glue-crawler/foo/',
+                'Exclusions': [
+                    's3://test-glue-crawler/bar/',
+                ],
+                'ConnectionName': 'test-s3-conn',
+            }
+        ],
+        'JdbcTargets': [
+            {
+                'ConnectionName': 'test-jdbc-conn',
+                'Path': 'test_db/test_table>',
+                'Exclusions': [
+                    'string',
+                ],
+            }
+        ],
+        'MongoDBTargets': [
+            {'ConnectionName': 'test-mongo-conn', 'Path': 'test_db/test_collection', 'ScanAll': True}
+        ],
+        'DynamoDBTargets': [{'Path': 'test_db/test_table', 'scanAll': True, 'scanRate': 123.0}],
+        'CatalogTargets': [
+            {
+                'DatabaseName': 'test_glue_db',
+                'Tables': [
+                    'test',
+                ],
+            }
+        ],
+    },
+    'Classifiers': ['test-classifier'],
+    'TablePrefix': 'test',
+    'SchemaChangePolicy': {
+        'UpdateBehavior': 'UPDATE_IN_DATABASE',
+        'DeleteBehavior': 'DEPRECATE_IN_DATABASE',
+    },
+    'RecrawlPolicy': {'RecrawlBehavior': 'CRAWL_EVERYTHING'},
+    'LineageConfiguration': 'ENABLE',
+    'Configuration': """
+    {
+        "Version": 1.0,
+        "CrawlerOutput": {
+            "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
+        }
+    }
+    """,
+    'SecurityConfiguration': 'test',
+    'Tags': {'test': 'foo'},
+}
+
+
+class TestAwsGlueCrawlerHook(unittest.TestCase):
+    @cached_property

Review comment:
       isn't setUp a special method that is run at start of each class... i am surprised if this doesn't cause error cus i assume it would try to call `setUp()` but as it is a property this would not work
   
   if you want it to run only once you can implement `setUpClass` instead




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] github-actions[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-745618176


   [The Workflow run](https://github.com/apache/airflow/actions/runs/424330825) is cancelling this PR. Building image for the PR has been cancelled


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r560295521



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,216 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler.
+
+    Additional arguments (such as ``aws_conn_id``) may be specified and
+    are passed down to the underlying AwsBaseHook.
+
+    .. seealso::
+        :class:`~airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook`
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        a non-existing role as a role trust policy error.
+
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def has_crawler(self, crawler_name) -> bool:
+        """
+        Checks if the crawler already exists
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Returns True if the crawler already exists and False if not.
+        """
+        self.log.info("Checking if AWS Glue crawler already exists: %s", crawler_name)
+
+        try:
+            self.glue_client.get_crawler(Name=crawler_name)
+            return True
+        except self.glue_client.exceptions.EntityNotFoundException:
+            return False
+
+    def update_crawler(self, **crawler_kwargs) -> str:
+        """
+        Updates crawler configurations
+
+        :param crawler_kwargs = Keyword args that define the configurations used for the crawler
+        :type crawler_kwargs = any
+        :return: True if crawler was updated and false otherwise
+        """
+        crawler_name = crawler_kwargs['Name']
+        current_crawler = self.glue_client.get_crawler(Name=crawler_name)['Crawler']
+
+        update_config = {
+            key: value for key, value in crawler_kwargs.items() if current_crawler[key] != crawler_kwargs[key]
+        }
+        if update_config != {}:
+            self.log.info("Updating crawler: %s", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            self.log.info("Updated configurations: %s", update_config)
+            return True
+        else:
+            return False
+
+    def create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates an AWS Glue Crawler
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        self.log.info("Creating AWS Glue crawler: %s", crawler_name)
+
+        try:
+            glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.InvalidInputException as general_error:
+            self.check_iam_role(crawler_kwargs['Role'])
+            raise AirflowException(general_error)
+
+    def start_crawler(self, crawler_name: str) -> dict:
+        """
+        Triggers the AWS Glue crawler
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Empty dictionary
+        """
+        crawler = self.glue_client.start_crawler(Name=crawler_name)
+        return crawler
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        crawler = self.glue_client.get_crawler(Name=crawler_name)
+        crawler_state = crawler['Crawler']['State']

Review comment:
       maybe consolidate `get_crawler_status()` and `get_crawler_state()` into one `get_crawler()` function that returns the results of the `glue_client.get_crawler()` and then just index into the response to get the state/status within `wait_for_crawler_completion()`?
   
   something like this within `wait_for_crawler_completion()`:
   
   ```
   while True:
               crawler = self.get_crawler(crawler_name)
               crawler_state = crawler['Crawler']['State']
               if crawler_state == 'READY':
                   self.log.info("State: %s", crawler_state)
                   crawler_status = crawler['Crawler']['LastCrawl']['Status']
                   if crawler_status in failed_status:
                       raise AirflowException(
                           f"Status: {crawler_status}"
                       )  # pylint: disable=raising-format-tuple
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] github-actions[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-745525524


   [The Workflow run](https://github.com/apache/airflow/actions/runs/424009609) is cancelling this PR. Building image for the PR has been cancelled


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] boring-cyborg[bot] commented on pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

boring-cyborg[bot] commented on pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#issuecomment-766606970


   Awesome work, congrats on your first merged pull request!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r556716347



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def get_or_create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create/update the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        try:
+            glue_response = self.glue_client.get_crawler(Name=crawler_name)
+            self.log.info("Crawler %s already exists; updating crawler", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.EntityNotFoundException:
+            self.log.info("Creating AWS Glue crawler %s", crawler_name)
+            try:
+                glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+                return glue_response['Crawler']['Name']
+            except self.glue_client.exceptions.InvalidInputException as general_error:
+                self.check_iam_role(crawler_kwargs['Role'])
+                raise AirflowException(general_error)

Review comment:
       @feluelle That's a great idea! Seems like it will remove a future pain point in regards to the current `get_or_create_crawler()` function.  Specifically, in a scenario where the user uses the function to just see if the crawler exists but accidentally creates a typo with the crawler_name and ends up creating a new crawler with the same configurations but with a slightly different name. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] marshall7m commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

marshall7m commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r556723139



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def get_or_create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create/update the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        try:
+            glue_response = self.glue_client.get_crawler(Name=crawler_name)
+            self.log.info("Crawler %s already exists; updating crawler", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)
+            return glue_response['Crawler']['Name']
+        except self.glue_client.exceptions.EntityNotFoundException:
+            self.log.info("Creating AWS Glue crawler %s", crawler_name)
+            try:
+                glue_response = self.glue_client.create_crawler(**crawler_kwargs)
+                return glue_response['Crawler']['Name']
+            except self.glue_client.exceptions.InvalidInputException as general_error:
+                self.check_iam_role(crawler_kwargs['Role'])
+                raise AirflowException(general_error)

Review comment:
       Maybe the operator should be an if/else since the `get_crawler()` function will also have the Glue API's `update_crawler()` function in it and it would be unnecessary to call `update_crawler()` after running `create_crawler()`
   
   For example:
   ```
   if not self.hook.has_crawler(**self.config):
       crawler_name = self.hook.create_crawler(**self.config)
   else:
       crawler_name = self.hook.get_crawler(**self.config)
   ```
   
   Update:
   
   Or create separate functions for `update_crawler()` and `get_crawler()` within the hook and call them separately in the operator:
   
   ```
   if not self.hook.has_crawler(**self.config):
       crawler_name = self.hook.create_crawler(**self.config)
   else:
       crawler_name = self.hook.get_crawler(**self.config)
       self.hook.update_crawler(**self.config)
   ```
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] feluelle merged pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

feluelle merged pull request #13072:
URL: https://github.com/apache/airflow/pull/13072


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mik-laj commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mik-laj commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r543723847



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,282 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import time
+from typing import Dict, List, Optional
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param crawler_name = Unique crawler name per AWS Account
+    :type crawler_name = Optional[str]
+    :param crawler_desc = Crawler description
+    :type crawler_desc = Optional[str]
+    :param glue_db_name = AWS glue catalog database ID
+    :type glue_db_name = Optional[str]
+    :param iam_role_name = AWS IAM role for glue crawler
+    :type iam_role_name = Optional[str]
+    :param region_name = AWS region name (e.g. 'us-west-2')
+    :type region_name = Optional[str]
+    :param s3_target_configuration = Configurations for crawling AWS S3 paths
+    :type s3_target_configuration = Optional[list]
+    :param jdbc_target_configuration = Configurations for crawling JDBC paths
+    :type jdbc_target_configuration = Optional[list]
+    :param mongo_target_configuration = Configurations for crawling AWS DocumentDB or MongoDB
+    :type mongo_target_configuration = Optional[list]
+    :param dynamo_target_configuration = Configurations for crawling AWS DynamoDB
+    :type dynamo_target_configuration = Optional[list]
+    :param glue_catalog_target_configuration = Configurations for crawling AWS Glue CatalogDB
+    :type glue_catalog_target_configuration = Optional[list]
+    :param cron_schedule = Cron expression used to define the crawler schedule (e.g. cron(11 18 * ? * *))
+    :type cron_schedule = Optional[str]
+    :param classifiers = List of user defined custom classifiers to be used by the crawler
+    :type classifiers = Optional[list]
+    :param table_prefix = Prefix for catalog table to be created
+    :type table_prefix = Optional[str]
+    :param update_behavior = Behavior when the crawler identifies schema changes
+    :type update_behavior = Optional[str]
+    :param delete_behavior = Behavior when the crawler identifies deleted objects
+    :type delete_behavior = Optional[str]
+    :param recrawl_behavior = Behavior when the crawler needs to crawl again
+    :type recrawl_behavior = Optional[str]
+    :param lineage_settings = Enables or disables data lineage
+    :type lineage_settings = Optional[str]
+    :param json_configuration = Versioned JSON configuration for the crawler
+    :type json_configuration = Optional[str]
+    :param security_configuration = Name of the security configuration structure to be used by the crawler.
+    :type security_configuration = Optional[str]
+    :param tags = Tags to attach to the crawler request
+    :type tags = Optional[dict]
+    """
+
+    CRAWLER_POLL_INTERVAL = 6  # polls crawler status after every CRAWLER_POLL_INTERVAL seconds
+
+    def __init__(
+        self,
+        crawler_name=None,
+        crawler_desc=None,
+        glue_db_name=None,
+        iam_role_name=None,
+        s3_targets_configuration=None,
+        jdbc_targets_configuration=None,
+        mongo_targets_configuration=None,
+        dynamo_targets_configuration=None,
+        glue_catalog_targets_configuration=None,
+        cron_schedule=None,
+        classifiers=None,
+        table_prefix=None,
+        update_behavior=None,
+        delete_behavior=None,
+        recrawl_behavior=None,
+        lineage_settings=None,
+        json_configuration=None,
+        security_configuration=None,
+        tags=None,
+        *args,
+        **kwargs,
+    ):
+
+        self.crawler_name = crawler_name
+        self.crawler_desc = crawler_desc
+        self.glue_db_name = glue_db_name
+        self.iam_role_name = iam_role_name
+        self.s3_targets_configuration = s3_targets_configuration
+        self.jdbc_targets_configuration = jdbc_targets_configuration
+        self.mongo_targets_configuration = mongo_targets_configuration
+        self.dynamo_targets_configuration = dynamo_targets_configuration
+        self.glue_catalog_targets_configuration = glue_catalog_targets_configuration
+        self.cron_schedule = cron_schedule
+        self.classifiers = classifiers
+        self.table_prefix = table_prefix
+        self.update_behavior = update_behavior
+        self.delete_behavior = delete_behavior
+        self.recrawl_behavior = recrawl_behavior
+        self.lineage_settings = lineage_settings
+        self.json_configuration = json_configuration
+        self.security_configuration = security_configuration
+        self.tags = tags
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    def list_crawlers(self) -> List:
+        ":return: Lists of Crawlers"
+        conn = self.get_conn()
+        return conn.get_crawlers()
+
+    def get_iam_execution_role(self) -> Dict:
+        ":return: iam role for crawler execution"
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        try:
+            glue_execution_role = iam_client.get_role(RoleName=self.iam_role_name)
+            self.log.info("Iam Role Name: %s", self.iam_role_name)
+            return glue_execution_role
+        except Exception as general_error:
+            self.log.error("Failed to create aws glue crawler, error: %s", general_error)
+            raise
+
+    def initialize_crawler(self):
+        """
+        Initializes connection with AWS Glue to run crawler
+        :return:
+        """
+        glue_client = self.get_conn()
+
+        try:
+            crawler_name = self.get_or_create_glue_crawler()
+            crawler_run = glue_client.start_crawler(Name=crawler_name)
+            return crawler_run
+        except Exception as general_error:
+            self.log.error("Failed to run aws glue crawler, error: %s", general_error)
+            raise
+
+    def get_crawler_state(self, crawler_name: str) -> str:
+        """
+        Get state of the Glue crawler. The crawler state can be
+        ready, running, or stopping.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: State of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_state = crawler_run['Crawler']['State']
+        return crawler_run_state
+
+    def get_crawler_status(self, crawler_name: str) -> str:
+        """
+        Get current status of the Glue crawler. The crawler
+        status can be succeeded, cancelled, or failed.
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Status of the Glue crawler
+        """
+        glue_client = self.get_conn()
+        crawler_run = glue_client.get_crawler(Name=crawler_name)
+        crawler_run_status = crawler_run['Crawler']['LastCrawl']['Status']
+        return crawler_run_status
+
+    def crawler_completion(self, crawler_name: str) -> str:
+        """
+        Waits until Glue crawler with crawler_name completes or
+        fails and returns final state if finished.
+        Raises AirflowException when the crawler failed
+        :param crawler_name: unique crawler name per AWS account
+        :type crawler_name: str
+        :return: Dict of crawler's status
+        """
+        failed_status = ['FAILED', 'CANCELLED']
+
+        while True:
+            crawler_run_state = self.get_crawler_state(crawler_name)
+            if crawler_run_state == 'READY':
+                self.log.info("Crawler: %s State: %s", crawler_name, crawler_run_state)
+                crawler_run_status = self.get_crawler_status(crawler_name)
+                if crawler_run_status in failed_status:
+                    crawler_error_message = (
+                        "Exiting Crawler: " + crawler_name + " Run State: " + crawler_run_state
+                    )
+                    self.log.info(crawler_error_message)
+                    raise AirflowException(crawler_error_message)
+                else:
+                    self.log.info("Crawler Status: %s", crawler_run_status)
+                    metrics = self.get_crawler_metrics(self.crawler_name)
+                    print('Last Runtime Duration (seconds): ', metrics['LastRuntimeSeconds'])
+                    print('Median Runtime Duration (seconds): ', metrics['MedianRuntimeSeconds'])
+                    print('Tables Created: ', metrics['TablesCreated'])
+                    print('Tables Updated: ', metrics['TablesUpdated'])
+                    print('Tables Deleted: ', metrics['TablesDeleted'])
+                    
+                    return {'Status': crawler_run_status}
+                    
+            else:
+                self.log.info(
+                    "Polling for AWS Glue crawler: %s Current run state: %s",
+                    self.crawler_name,
+                    crawler_run_state,
+                )
+                time.sleep(self.CRAWLER_POLL_INTERVAL)
+
+                metrics = self.get_crawler_metrics(self.crawler_name)
+                time_left = int(metrics['TimeLeftSeconds'])
+                
+                if time_left > 0:
+                    print('Estimated Time Left (seconds): ', time_left)
+                    self.CRAWLER_POLL_INTERVAL = time_left
+                else:
+                    print('Crawler should finish soon')
+
+    def get_or_create_glue_crawler(self) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+        :return:Name of the crawler

Review comment:
       ```suggestion
   
           :return :Name of the crawler
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] mschmo commented on a change in pull request #13072: AWS Glue Crawler Integration

Posted by GitBox <gi...@apache.org>.

mschmo commented on a change in pull request #13072:
URL: https://github.com/apache/airflow/pull/13072#discussion_r551636429



##########
File path: airflow/providers/amazon/aws/hooks/glue_crawler.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from time import sleep
+
+from cached_property import cached_property
+
+from airflow.exceptions import AirflowException
+from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
+
+
+class AwsGlueCrawlerHook(AwsBaseHook):
+    """
+    Interacts with AWS Glue Crawler
+    :param config = Configurations for the AWS Glue crawler
+    :type config = dict
+    """
+
+    def __init__(self, *args, **kwargs):
+        kwargs['client_type'] = 'glue'
+        super().__init__(*args, **kwargs)
+
+    @cached_property
+    def glue_client(self):
+        """:return: AWS Glue client"""
+        return self.get_conn()
+
+    def check_iam_role(self, role_name: str) -> str:
+        """
+        Checks if the input IAM role name is a
+        valid pre-existing role within the caller's AWS account.
+        Is needed because the current Boto3 (<=1.16.46)
+        glue client create_crawler() method misleadingly catches
+        non-existing role as a role trust policy error.
+        :param role_name = IAM role name
+        :type role_name = str
+        :return: IAM role name
+        """
+        iam_client = self.get_client_type('iam', self.region_name)
+
+        iam_client.get_role(RoleName=role_name)
+
+    def get_or_create_crawler(self, **crawler_kwargs) -> str:
+        """
+        Creates the crawler if the crawler doesn't exists and returns the crawler name
+
+        :param crawler_kwargs = Keyword args that define the configurations used to create/update the crawler
+        :type crawler_kwargs = any
+        :return: Name of the crawler
+        """
+        crawler_name = crawler_kwargs['Name']
+        try:
+            glue_response = self.glue_client.get_crawler(Name=crawler_name)
+            self.log.info("Crawler %s already exists; updating crawler", crawler_name)
+            self.glue_client.update_crawler(**crawler_kwargs)

Review comment:
       Nice, I overlooked this update API method when implementing a custom Glue hook/operator in my own project. Good to know about!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org