You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/22 10:37:11 UTC

[GitHub] [airflow] alexott opened a new pull request #22422: More operators for Databricks Repos

alexott opened a new pull request #22422:
URL: https://github.com/apache/airflow/pull/22422


   This PR adds two additional operators for Databricks Repos:
   * `DatabricksReposCreateOperator` - to create a new Databricks Repo
   * `DatabricksReposDeleteOperator` - to delete a Databricks Repo


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott commented on a change in pull request #22422: More operators for Databricks Repos

Posted by GitBox <gi...@apache.org>.
alexott commented on a change in pull request #22422:
URL: https://github.com/apache/airflow/pull/22422#discussion_r834082159



##########
File path: airflow/providers/databricks/operators/databricks_repos.py
##########
@@ -28,12 +29,131 @@
     from airflow.utils.context import Context
 
 
+class DatabricksReposCreateOperator(BaseOperator):
+    """
+    Creates a Databricks Repo
+    using
+    `POST api/2.0/repos <https://docs.databricks.com/dev-tools/api/latest/repos.html#operation/create-repo>`_
+    API endpoint and optionally checking it out to a specific branch or tag.
+
+    :param git_url: Required HTTPS URL of a Git repository
+    :param git_provider: Optional name of Git provider. Must be provided if we can't guess its name from URL.
+    :param repo_path: optional path for a repository. Must be in the format ``/Repos/{folder}/{repo-name}``.
+        If not specified, it will be created in the user's directory.
+    :param branch: optional name of branch to check out.
+    :param tag: optional name of tag to checkout.
+    :param databricks_conn_id: Reference to the :ref:`Databricks connection <howto/connection:databricks>`.
+        By default and in the common case this will be ``databricks_default``. To use
+        token based authentication, provide the key ``token`` in the extra field for the
+        connection and create the key ``host`` and leave the ``host`` field empty.
+    :param databricks_retry_limit: Amount of times retry if the Databricks backend is
+        unreachable. Its value must be greater than or equal to 1.
+    :param databricks_retry_delay: Number of seconds to wait between retries (it
+            might be a floating point number).
+    """
+
+    # Used in airflow.models.BaseOperator
+    template_fields: Sequence[str] = ('repo_path', 'git_url', 'tag', 'branch')
+
+    __git_providers__ = {
+        "github.com": "gitHub",
+        "dev.azure.com": "azureDevOpsServices",
+        "gitlab.com": "gitLab",
+        "bitbucket.org": "bitbucketCloud",
+    }
+    __aws_code_commit_regexp__ = re.compile(r"^git-codecommit\.[^.]+\.amazonaws.com$")
+    __repos_path_regexp__ = re.compile(r"/Repos/[^/]+/[^/]+/?$")
+
+    def __init__(
+        self,
+        *,
+        git_url: str,
+        git_provider: Optional[str] = None,
+        branch: Optional[str] = None,
+        tag: Optional[str] = None,
+        repo_path: Optional[str] = None,
+        databricks_conn_id: str = 'databricks_default',
+        databricks_retry_limit: int = 3,
+        databricks_retry_delay: int = 1,
+        **kwargs,
+    ) -> None:
+        """Creates a new ``DatabricksReposCreateOperator``."""
+        super().__init__(**kwargs)
+        self.databricks_conn_id = databricks_conn_id
+        self.databricks_retry_limit = databricks_retry_limit
+        self.databricks_retry_delay = databricks_retry_delay
+        self.git_url = git_url
+        if git_provider is None:
+            self.git_provider = self.__detect_repo_provider__(git_url)
+            if self.git_provider is None:
+                raise AirflowException(
+                    "git_provider isn't specified and couldn't be guessed" f" for URL {git_url}"
+                )
+        else:
+            self.git_provider = git_provider
+        if repo_path is not None and not self.__repos_path_regexp__.match(repo_path):
+            raise AirflowException(
+                f"repo_path should have form of /Repos/{{folder}}/{{repo-name}}, got '{repo_path}'"
+            )

Review comment:
       thank you - I'll fix it later today.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #22422: More operators for Databricks Repos

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #22422:
URL: https://github.com/apache/airflow/pull/22422#issuecomment-1075049965


   Can this functionality be done without adding a new operator? I am afraid of adding more operators when we do not even have plans on how we want to maintain them and test if they work properly. Commiets check that the change looks good. Sometimes they will look at the documentation and see if it matches what is implemeents, but it may not work properly. What can we do to improve the operational readiness of this provider, since we have recently had more changes in it?
   
   For Snowflake, we decided that we only have one operator, so it's easy to check that it's working properly.
   For Google/AWS, we have prepared system tests and we want to run them on CI.
   
   What would we like to do for Databricks? Do you have any plans?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] josh-fell commented on a change in pull request #22422: More operators for Databricks Repos

Posted by GitBox <gi...@apache.org>.
josh-fell commented on a change in pull request #22422:
URL: https://github.com/apache/airflow/pull/22422#discussion_r833358353



##########
File path: airflow/providers/databricks/operators/databricks_repos.py
##########
@@ -28,12 +29,131 @@
     from airflow.utils.context import Context
 
 
+class DatabricksReposCreateOperator(BaseOperator):
+    """
+    Creates a Databricks Repo
+    using
+    `POST api/2.0/repos <https://docs.databricks.com/dev-tools/api/latest/repos.html#operation/create-repo>`_
+    API endpoint and optionally checking it out to a specific branch or tag.
+
+    :param git_url: Required HTTPS URL of a Git repository
+    :param git_provider: Optional name of Git provider. Must be provided if we can't guess its name from URL.
+    :param repo_path: optional path for a repository. Must be in the format ``/Repos/{folder}/{repo-name}``.
+        If not specified, it will be created in the user's directory.
+    :param branch: optional name of branch to check out.
+    :param tag: optional name of tag to checkout.
+    :param databricks_conn_id: Reference to the :ref:`Databricks connection <howto/connection:databricks>`.
+        By default and in the common case this will be ``databricks_default``. To use
+        token based authentication, provide the key ``token`` in the extra field for the
+        connection and create the key ``host`` and leave the ``host`` field empty.
+    :param databricks_retry_limit: Amount of times retry if the Databricks backend is
+        unreachable. Its value must be greater than or equal to 1.
+    :param databricks_retry_delay: Number of seconds to wait between retries (it
+            might be a floating point number).
+    """
+
+    # Used in airflow.models.BaseOperator
+    template_fields: Sequence[str] = ('repo_path', 'git_url', 'tag', 'branch')
+
+    __git_providers__ = {
+        "github.com": "gitHub",
+        "dev.azure.com": "azureDevOpsServices",
+        "gitlab.com": "gitLab",
+        "bitbucket.org": "bitbucketCloud",
+    }
+    __aws_code_commit_regexp__ = re.compile(r"^git-codecommit\.[^.]+\.amazonaws.com$")
+    __repos_path_regexp__ = re.compile(r"/Repos/[^/]+/[^/]+/?$")
+
+    def __init__(
+        self,
+        *,
+        git_url: str,
+        git_provider: Optional[str] = None,
+        branch: Optional[str] = None,
+        tag: Optional[str] = None,
+        repo_path: Optional[str] = None,
+        databricks_conn_id: str = 'databricks_default',
+        databricks_retry_limit: int = 3,
+        databricks_retry_delay: int = 1,
+        **kwargs,
+    ) -> None:
+        """Creates a new ``DatabricksReposCreateOperator``."""
+        super().__init__(**kwargs)
+        self.databricks_conn_id = databricks_conn_id
+        self.databricks_retry_limit = databricks_retry_limit
+        self.databricks_retry_delay = databricks_retry_delay
+        self.git_url = git_url
+        if git_provider is None:
+            self.git_provider = self.__detect_repo_provider__(git_url)
+            if self.git_provider is None:
+                raise AirflowException(
+                    "git_provider isn't specified and couldn't be guessed" f" for URL {git_url}"
+                )
+        else:
+            self.git_provider = git_provider
+        if repo_path is not None and not self.__repos_path_regexp__.match(repo_path):
+            raise AirflowException(
+                f"repo_path should have form of /Repos/{{folder}}/{{repo-name}}, got '{repo_path}'"
+            )

Review comment:
       Since `repo_path` is a templated field, there is a chance that the arg passed in is a Jinja expression or an `XComArg` in which case the regexp check would fail with a false-negative result potentially because templated values aren't evaluated until the task begins to execute.
   
   Can you move this to the `execute()` method scope?

##########
File path: airflow/providers/databricks/operators/databricks_repos.py
##########
@@ -28,12 +29,131 @@
     from airflow.utils.context import Context
 
 
+class DatabricksReposCreateOperator(BaseOperator):
+    """
+    Creates a Databricks Repo
+    using
+    `POST api/2.0/repos <https://docs.databricks.com/dev-tools/api/latest/repos.html#operation/create-repo>`_
+    API endpoint and optionally checking it out to a specific branch or tag.
+
+    :param git_url: Required HTTPS URL of a Git repository
+    :param git_provider: Optional name of Git provider. Must be provided if we can't guess its name from URL.
+    :param repo_path: optional path for a repository. Must be in the format ``/Repos/{folder}/{repo-name}``.
+        If not specified, it will be created in the user's directory.
+    :param branch: optional name of branch to check out.
+    :param tag: optional name of tag to checkout.
+    :param databricks_conn_id: Reference to the :ref:`Databricks connection <howto/connection:databricks>`.
+        By default and in the common case this will be ``databricks_default``. To use
+        token based authentication, provide the key ``token`` in the extra field for the
+        connection and create the key ``host`` and leave the ``host`` field empty.
+    :param databricks_retry_limit: Amount of times retry if the Databricks backend is
+        unreachable. Its value must be greater than or equal to 1.
+    :param databricks_retry_delay: Number of seconds to wait between retries (it
+            might be a floating point number).
+    """
+
+    # Used in airflow.models.BaseOperator
+    template_fields: Sequence[str] = ('repo_path', 'git_url', 'tag', 'branch')
+
+    __git_providers__ = {
+        "github.com": "gitHub",
+        "dev.azure.com": "azureDevOpsServices",
+        "gitlab.com": "gitLab",
+        "bitbucket.org": "bitbucketCloud",
+    }
+    __aws_code_commit_regexp__ = re.compile(r"^git-codecommit\.[^.]+\.amazonaws.com$")
+    __repos_path_regexp__ = re.compile(r"/Repos/[^/]+/[^/]+/?$")
+
+    def __init__(
+        self,
+        *,
+        git_url: str,
+        git_provider: Optional[str] = None,
+        branch: Optional[str] = None,
+        tag: Optional[str] = None,
+        repo_path: Optional[str] = None,
+        databricks_conn_id: str = 'databricks_default',
+        databricks_retry_limit: int = 3,
+        databricks_retry_delay: int = 1,
+        **kwargs,
+    ) -> None:
+        """Creates a new ``DatabricksReposCreateOperator``."""
+        super().__init__(**kwargs)
+        self.databricks_conn_id = databricks_conn_id
+        self.databricks_retry_limit = databricks_retry_limit
+        self.databricks_retry_delay = databricks_retry_delay
+        self.git_url = git_url
+        if git_provider is None:
+            self.git_provider = self.__detect_repo_provider__(git_url)

Review comment:
       Same comment here since `git_url` is also a templated field. This _might_ not return the expected result or fail unexpectedly trying to parse a templated string before it is rendered.

##########
File path: docs/apache-airflow-providers-databricks/connections/databricks.rst
##########
@@ -64,9 +64,9 @@ Password (optional)
 Extra (optional)
     Specify the extra parameter (as json dictionary) that can be used in the Databricks connection.
 
-    Following parameter should be used if using the *PAT* authentication method:
+    Following parameter could be used if using the *PAT* authentication method:

Review comment:
       ```suggestion
       The following parameter could be used if using the *PAT* authentication method:
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #22422: More operators for Databricks Repos

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #22422:
URL: https://github.com/apache/airflow/pull/22422


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #22422: More operators for Databricks Repos

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #22422:
URL: https://github.com/apache/airflow/pull/22422#discussion_r834045820



##########
File path: airflow/providers/databricks/operators/databricks_repos.py
##########
@@ -28,12 +29,131 @@
     from airflow.utils.context import Context
 
 
+class DatabricksReposCreateOperator(BaseOperator):
+    """
+    Creates a Databricks Repo
+    using
+    `POST api/2.0/repos <https://docs.databricks.com/dev-tools/api/latest/repos.html#operation/create-repo>`_
+    API endpoint and optionally checking it out to a specific branch or tag.
+
+    :param git_url: Required HTTPS URL of a Git repository
+    :param git_provider: Optional name of Git provider. Must be provided if we can't guess its name from URL.
+    :param repo_path: optional path for a repository. Must be in the format ``/Repos/{folder}/{repo-name}``.
+        If not specified, it will be created in the user's directory.
+    :param branch: optional name of branch to check out.
+    :param tag: optional name of tag to checkout.
+    :param databricks_conn_id: Reference to the :ref:`Databricks connection <howto/connection:databricks>`.
+        By default and in the common case this will be ``databricks_default``. To use
+        token based authentication, provide the key ``token`` in the extra field for the
+        connection and create the key ``host`` and leave the ``host`` field empty.
+    :param databricks_retry_limit: Amount of times retry if the Databricks backend is
+        unreachable. Its value must be greater than or equal to 1.
+    :param databricks_retry_delay: Number of seconds to wait between retries (it
+            might be a floating point number).
+    """
+
+    # Used in airflow.models.BaseOperator
+    template_fields: Sequence[str] = ('repo_path', 'git_url', 'tag', 'branch')
+
+    __git_providers__ = {
+        "github.com": "gitHub",
+        "dev.azure.com": "azureDevOpsServices",
+        "gitlab.com": "gitLab",
+        "bitbucket.org": "bitbucketCloud",
+    }
+    __aws_code_commit_regexp__ = re.compile(r"^git-codecommit\.[^.]+\.amazonaws.com$")
+    __repos_path_regexp__ = re.compile(r"/Repos/[^/]+/[^/]+/?$")
+
+    def __init__(
+        self,
+        *,
+        git_url: str,
+        git_provider: Optional[str] = None,
+        branch: Optional[str] = None,
+        tag: Optional[str] = None,
+        repo_path: Optional[str] = None,
+        databricks_conn_id: str = 'databricks_default',
+        databricks_retry_limit: int = 3,
+        databricks_retry_delay: int = 1,
+        **kwargs,
+    ) -> None:
+        """Creates a new ``DatabricksReposCreateOperator``."""
+        super().__init__(**kwargs)
+        self.databricks_conn_id = databricks_conn_id
+        self.databricks_retry_limit = databricks_retry_limit
+        self.databricks_retry_delay = databricks_retry_delay
+        self.git_url = git_url
+        if git_provider is None:
+            self.git_provider = self.__detect_repo_provider__(git_url)
+            if self.git_provider is None:
+                raise AirflowException(
+                    "git_provider isn't specified and couldn't be guessed" f" for URL {git_url}"
+                )
+        else:
+            self.git_provider = git_provider
+        if repo_path is not None and not self.__repos_path_regexp__.match(repo_path):
+            raise AirflowException(
+                f"repo_path should have form of /Repos/{{folder}}/{{repo-name}}, got '{repo_path}'"
+            )

Review comment:
       good point @josh-fell !




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott commented on a change in pull request #22422: More operators for Databricks Repos

Posted by GitBox <gi...@apache.org>.
alexott commented on a change in pull request #22422:
URL: https://github.com/apache/airflow/pull/22422#discussion_r834566960



##########
File path: airflow/providers/databricks/operators/databricks_repos.py
##########
@@ -28,12 +29,131 @@
     from airflow.utils.context import Context
 
 
+class DatabricksReposCreateOperator(BaseOperator):
+    """
+    Creates a Databricks Repo
+    using
+    `POST api/2.0/repos <https://docs.databricks.com/dev-tools/api/latest/repos.html#operation/create-repo>`_
+    API endpoint and optionally checking it out to a specific branch or tag.
+
+    :param git_url: Required HTTPS URL of a Git repository
+    :param git_provider: Optional name of Git provider. Must be provided if we can't guess its name from URL.
+    :param repo_path: optional path for a repository. Must be in the format ``/Repos/{folder}/{repo-name}``.
+        If not specified, it will be created in the user's directory.
+    :param branch: optional name of branch to check out.
+    :param tag: optional name of tag to checkout.
+    :param databricks_conn_id: Reference to the :ref:`Databricks connection <howto/connection:databricks>`.
+        By default and in the common case this will be ``databricks_default``. To use
+        token based authentication, provide the key ``token`` in the extra field for the
+        connection and create the key ``host`` and leave the ``host`` field empty.
+    :param databricks_retry_limit: Amount of times retry if the Databricks backend is
+        unreachable. Its value must be greater than or equal to 1.
+    :param databricks_retry_delay: Number of seconds to wait between retries (it
+            might be a floating point number).
+    """
+
+    # Used in airflow.models.BaseOperator
+    template_fields: Sequence[str] = ('repo_path', 'git_url', 'tag', 'branch')
+
+    __git_providers__ = {
+        "github.com": "gitHub",
+        "dev.azure.com": "azureDevOpsServices",
+        "gitlab.com": "gitLab",
+        "bitbucket.org": "bitbucketCloud",
+    }
+    __aws_code_commit_regexp__ = re.compile(r"^git-codecommit\.[^.]+\.amazonaws.com$")
+    __repos_path_regexp__ = re.compile(r"/Repos/[^/]+/[^/]+/?$")
+
+    def __init__(
+        self,
+        *,
+        git_url: str,
+        git_provider: Optional[str] = None,
+        branch: Optional[str] = None,
+        tag: Optional[str] = None,
+        repo_path: Optional[str] = None,
+        databricks_conn_id: str = 'databricks_default',
+        databricks_retry_limit: int = 3,
+        databricks_retry_delay: int = 1,
+        **kwargs,
+    ) -> None:
+        """Creates a new ``DatabricksReposCreateOperator``."""
+        super().__init__(**kwargs)
+        self.databricks_conn_id = databricks_conn_id
+        self.databricks_retry_limit = databricks_retry_limit
+        self.databricks_retry_delay = databricks_retry_delay
+        self.git_url = git_url
+        if git_provider is None:
+            self.git_provider = self.__detect_repo_provider__(git_url)
+            if self.git_provider is None:
+                raise AirflowException(
+                    "git_provider isn't specified and couldn't be guessed" f" for URL {git_url}"
+                )
+        else:
+            self.git_provider = git_provider
+        if repo_path is not None and not self.__repos_path_regexp__.match(repo_path):
+            raise AirflowException(
+                f"repo_path should have form of /Repos/{{folder}}/{{repo-name}}, got '{repo_path}'"
+            )

Review comment:
       @potiuk @josh-fell I've addressed review comments...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] alexott commented on pull request #22422: More operators for Databricks Repos

Posted by GitBox <gi...@apache.org>.
alexott commented on pull request #22422:
URL: https://github.com/apache/airflow/pull/22422#issuecomment-1075071687


   @mik-laj These operators are for specific functionality that is available via specific REST API, so alternative will be to get down to low-level API calls. It's a different from the Snowflake connector where you do everything via SQL - for Databricks we have similar `DatabricksSqlOperator`...
   
   We're planning to build a system test suite that will be used to test all operators.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22422: More operators for Databricks Repos

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22422:
URL: https://github.com/apache/airflow/pull/22422#issuecomment-1075086029


   > We're planning to build a system test suite that will be used to test all operators.
   
   Yep. We are just about to merge first change from AIP-47 https://github.com/apache/airflow/pull/22311 and we will literally prepare (automatically) issues to convert all the providers. I will ask everyone invlved to comment and will assign them to those issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org