You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/08/02 09:40:07 UTC

[GitHub] [airflow] MaksYermak opened a new pull request, #25466: Auto ML assets

MaksYermak opened a new pull request, #25466:
URL: https://github.com/apache/airflow/pull/25466

   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   I have created links and updated system tests for Auto ML operators. 
   
   Co-authored-by: Wojciech Januszek [januszek@google.com](mailto:januszek@google.com)
   Co-authored-by: Lukasz Wyszomirski [wyszomirski@google.com](mailto:wyszomirski@google.com)
   Co-authored-by: Maksim Yermakou [maksimy@google.com](mailto:maksimy@google.com)
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] MaksYermak commented on a diff in pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
MaksYermak commented on code in PR #25466:
URL: https://github.com/apache/airflow/pull/25466#discussion_r952420087


##########
airflow/providers/google/cloud/links/automl.py:
##########
@@ -0,0 +1,161 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""This module contains Google AutoML links."""
+from typing import TYPE_CHECKING, Optional
+
+from airflow.providers.google.cloud.links.base import BaseGoogleLink
+
+if TYPE_CHECKING:
+    from airflow.utils.context import Context
+
+AUTOML_BASE_LINK = "https://console.cloud.google.com/automl-tables"
+AUTOML_DATASET_LINK = (
+    AUTOML_BASE_LINK + "/locations/{location}/datasets/{dataset_id}/schemav2?project={project_id}"
+)
+AUTOML_DATASET_LIST_LINK = AUTOML_BASE_LINK + "/datasets?project={project_id}"
+AUTOML_MODEL_LINK = (
+    AUTOML_BASE_LINK
+    + "/locations/{location}/datasets/{dataset_id};modelId={model_id}/evaluate?project={project_id}"
+)
+AUTOML_MODEL_TRAIN_LINK = (
+    AUTOML_BASE_LINK + "/locations/{location}/datasets/{dataset_id}/train?project={project_id}"
+)
+AUTOML_MODEL_PREDICT_LINK = (
+    AUTOML_BASE_LINK
+    + "/locations/{location}/datasets/{dataset_id};modelId={model_id}/predict?project={project_id}"
+)
+
+
+class AutoMLDatasetLink(BaseGoogleLink):
+    """Helper class for constructing AutoML Dataset link"""
+
+    name = "AutoML Dataset"
+    key = "automl_dataset"
+    format_str = AUTOML_DATASET_LINK
+
+    @staticmethod
+    def persist(
+        context: "Context",
+        task_instance,
+        dataset_id: str,
+        project_id: Optional[str],

Review Comment:
   @josh-fell I think it is good idea. I have changed the code in last commit. Now the `project_id` is required for links.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk merged pull request #25466: Auto ML assets

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk merged PR #25466:
URL: https://github.com/apache/airflow/pull/25466


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] josh-fell commented on a diff in pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
josh-fell commented on code in PR #25466:
URL: https://github.com/apache/airflow/pull/25466#discussion_r943033290


##########
airflow/providers/google/cloud/links/automl.py:
##########
@@ -0,0 +1,161 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""This module contains Google AutoML links."""
+from typing import TYPE_CHECKING, Optional
+
+from airflow.providers.google.cloud.links.base import BaseGoogleLink
+
+if TYPE_CHECKING:
+    from airflow.utils.context import Context
+
+AUTOML_BASE_LINK = "https://console.cloud.google.com/automl-tables"
+AUTOML_DATASET_LINK = (
+    AUTOML_BASE_LINK + "/locations/{location}/datasets/{dataset_id}/schemav2?project={project_id}"
+)
+AUTOML_DATASET_LIST_LINK = AUTOML_BASE_LINK + "/datasets?project={project_id}"
+AUTOML_MODEL_LINK = (
+    AUTOML_BASE_LINK
+    + "/locations/{location}/datasets/{dataset_id};modelId={model_id}/evaluate?project={project_id}"
+)
+AUTOML_MODEL_TRAIN_LINK = (
+    AUTOML_BASE_LINK + "/locations/{location}/datasets/{dataset_id}/train?project={project_id}"
+)
+AUTOML_MODEL_PREDICT_LINK = (
+    AUTOML_BASE_LINK
+    + "/locations/{location}/datasets/{dataset_id};modelId={model_id}/predict?project={project_id}"
+)
+
+
+class AutoMLDatasetLink(BaseGoogleLink):
+    """Helper class for constructing AutoML Dataset link"""
+
+    name = "AutoML Dataset"
+    key = "automl_dataset"
+    format_str = AUTOML_DATASET_LINK
+
+    @staticmethod
+    def persist(
+        context: "Context",
+        task_instance,
+        dataset_id: str,
+        project_id: Optional[str],

Review Comment:
   Thanks for the context! So thinking about this another way, it seems like mypy found an edge case where it's _possible_ to not pass a value for `project_id` when the `persist()` methods get called via the typing of the input args. Instead of trying to make mypy happy, it might be worthwhile adding a check ensuring a value is present to pass in as a `project_id` arg and having the typing of `project_id` in the operator link reflect that it _is_ required. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] josh-fell commented on a diff in pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
josh-fell commented on code in PR #25466:
URL: https://github.com/apache/airflow/pull/25466#discussion_r937957085


##########
airflow/providers/google/cloud/links/automl.py:
##########
@@ -0,0 +1,161 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""This module contains Google AutoML links."""
+from typing import TYPE_CHECKING, Optional
+
+from airflow.providers.google.cloud.links.base import BaseGoogleLink
+
+if TYPE_CHECKING:
+    from airflow.utils.context import Context
+
+AUTOML_BASE_LINK = "https://console.cloud.google.com/automl-tables"
+AUTOML_DATASET_LINK = (
+    AUTOML_BASE_LINK + "/locations/{location}/datasets/{dataset_id}/schemav2?project={project_id}"
+)
+AUTOML_DATASET_LIST_LINK = AUTOML_BASE_LINK + "/datasets?project={project_id}"
+AUTOML_MODEL_LINK = (
+    AUTOML_BASE_LINK
+    + "/locations/{location}/datasets/{dataset_id};modelId={model_id}/evaluate?project={project_id}"
+)
+AUTOML_MODEL_TRAIN_LINK = (
+    AUTOML_BASE_LINK + "/locations/{location}/datasets/{dataset_id}/train?project={project_id}"
+)
+AUTOML_MODEL_PREDICT_LINK = (
+    AUTOML_BASE_LINK
+    + "/locations/{location}/datasets/{dataset_id};modelId={model_id}/predict?project={project_id}"
+)
+
+
+class AutoMLDatasetLink(BaseGoogleLink):
+    """Helper class for constructing AutoML Dataset link"""
+
+    name = "AutoML Dataset"
+    key = "automl_dataset"
+    format_str = AUTOML_DATASET_LINK
+
+    @staticmethod
+    def persist(
+        context: "Context",
+        task_instance,
+        dataset_id: str,
+        project_id: Optional[str],

Review Comment:
   Looking at all of the URL formats, it seems like `project_id` is required. Is that true? If so, the typing should be updated to reflect this throughout.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] raphaelauv commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
raphaelauv commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1209948362

   there is a csv file tests/system/providers/google/cloud/automl/resources/bank-marketing.csv
   
   of 45K char , is that normal ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] josh-fell commented on a diff in pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
josh-fell commented on code in PR #25466:
URL: https://github.com/apache/airflow/pull/25466#discussion_r937965805


##########
tests/system/providers/google/cloud/automl/example_automl_dataset.py:
##########
@@ -0,0 +1,198 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Example Airflow DAG for Google AutoML service testing dataset operations.
+"""
+import os
+from copy import deepcopy
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, List
+
+from airflow import models
+from airflow.providers.google.cloud.hooks.automl import CloudAutoMLHook
+from airflow.providers.google.cloud.operators.automl import (
+    AutoMLCreateDatasetOperator,
+    AutoMLDeleteDatasetOperator,
+    AutoMLImportDataOperator,
+    AutoMLListDatasetOperator,
+    AutoMLTablesListColumnSpecsOperator,
+    AutoMLTablesListTableSpecsOperator,
+    AutoMLTablesUpdateDatasetOperator,
+)
+from airflow.providers.google.cloud.operators.gcs import GCSCreateBucketOperator, GCSDeleteBucketOperator
+from airflow.providers.google.cloud.transfers.local_to_gcs import LocalFilesystemToGCSOperator
+from airflow.utils.trigger_rule import TriggerRule
+
+ENV_ID = os.environ.get("SYSTEM_TESTS_ENV_ID")
+DAG_ID = "automl_dataset"
+GCP_PROJECT_ID = os.environ.get("SYSTEM_TESTS_GCP_PROJECT", "default")
+
+GCP_AUTOML_LOCATION = "us-central1"
+
+DATA_SAMPLE_GCS_BUCKET_NAME = f"bucket_{DAG_ID}_{ENV_ID}"
+DATA_SAMPLE_GCS_OBJECT_NAME = "automl/bank-marketing/bank-marketing.csv"
+CSV_FILE_LOCAL_PATH = str(Path(__file__).parent / "resources" / "bank-marketing.csv")
+
+DATASET_NAME = f"dataset_{DAG_ID}_{ENV_ID}"
+DATASET = {
+    "display_name": DATASET_NAME,
+    "tables_dataset_metadata": {"target_column_spec_id": ""},
+}
+AUTOML_DATASET_BUCKET = f"gs://{DATA_SAMPLE_GCS_BUCKET_NAME}/{DATA_SAMPLE_GCS_OBJECT_NAME}"
+IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [AUTOML_DATASET_BUCKET]}}
+
+extract_object_id = CloudAutoMLHook.extract_object_id
+
+
+def get_target_column_spec(columns_specs: List[Dict], column_name: str) -> str:
+    """
+    Using column name returns spec of the column.
+    """
+    for column in columns_specs:
+        if column["display_name"] == column_name:
+            return extract_object_id(column)
+    raise Exception(f"Unknown target column: {column_name}")
+
+
+with models.DAG(
+    DAG_ID,
+    schedule_interval="@once",
+    start_date=datetime(2021, 1, 1),
+    catchup=False,
+    tags=["example", "automl"],
+    user_defined_macros={
+        "get_target_column_spec": get_target_column_spec,
+        "target": "Class",
+        "extract_object_id": extract_object_id,
+    },
+) as dag:
+    create_bucket = GCSCreateBucketOperator(
+        task_id="create_bucket",
+        bucket_name=DATA_SAMPLE_GCS_BUCKET_NAME,
+        storage_class="REGIONAL",
+        location=GCP_AUTOML_LOCATION,
+    )
+
+    upload_file = LocalFilesystemToGCSOperator(
+        task_id="upload_file_to_bucket",
+        src=CSV_FILE_LOCAL_PATH,
+        dst=DATA_SAMPLE_GCS_OBJECT_NAME,
+        bucket=DATA_SAMPLE_GCS_BUCKET_NAME,
+    )
+
+    # [START howto_operator_automl_create_dataset]
+    create_dataset = AutoMLCreateDatasetOperator(
+        task_id="create_dataset",
+        dataset=DATASET,
+        location=GCP_AUTOML_LOCATION,
+        project_id=GCP_PROJECT_ID,
+    )
+    dataset_id = create_dataset.output['dataset_id']
+    # [END howto_operator_automl_create_dataset]
+
+    # [START howto_operator_automl_import_data]
+    import_dataset_task = AutoMLImportDataOperator(
+        task_id="import_dataset_task",
+        dataset_id=dataset_id,
+        location=GCP_AUTOML_LOCATION,
+        input_config=IMPORT_INPUT_CONFIG,
+    )
+    # [END howto_operator_automl_import_data]
+
+    # [START howto_operator_automl_specs]
+    list_tables_spec_task = AutoMLTablesListTableSpecsOperator(
+        task_id="list_tables_spec_task",
+        dataset_id=dataset_id,
+        location=GCP_AUTOML_LOCATION,
+        project_id=GCP_PROJECT_ID,
+    )
+    # [END howto_operator_automl_specs]
+
+    # [START howto_operator_automl_column_specs]
+    list_columns_spec_task = AutoMLTablesListColumnSpecsOperator(
+        task_id="list_columns_spec_task",
+        dataset_id=dataset_id,
+        table_spec_id="{{ extract_object_id(task_instance.xcom_pull('list_tables_spec_task')[0]) }}",
+        location=GCP_AUTOML_LOCATION,
+        project_id=GCP_PROJECT_ID,
+    )
+    # [END howto_operator_automl_column_specs]
+
+    # [START howto_operator_automl_update_dataset]
+    update = deepcopy(DATASET)
+    update["name"] = '{{ task_instance.xcom_pull("create_dataset")["name"] }}'
+    update["tables_dataset_metadata"][  # type: ignore
+        "target_column_spec_id"
+    ] = "{{ get_target_column_spec(task_instance.xcom_pull('list_columns_spec_task'), target) }}"
+
+    update_dataset_task = AutoMLTablesUpdateDatasetOperator(
+        task_id="update_dataset_task",
+        dataset=update,
+        location=GCP_AUTOML_LOCATION,
+    )
+    # [END howto_operator_automl_update_dataset]
+
+    # [START howto_operator_list_dataset]
+    list_datasets_task = AutoMLListDatasetOperator(
+        task_id="list_datasets_task",
+        location=GCP_AUTOML_LOCATION,
+        project_id=GCP_PROJECT_ID,
+    )
+    # [END howto_operator_list_dataset]
+
+    # [START howto_operator_delete_dataset]
+    delete_dataset = AutoMLDeleteDatasetOperator(
+        task_id="delete_dataset",
+        dataset_id="{{ task_instance.xcom_pull('list_datasets_task', key='dataset_id_list') | list }}",

Review Comment:
   Does setting `render_template_as_native_obj=True` at the DAG level help here? Or would that negatively impact some of the other `XComs` being pulled?
   
   If setting that parameter works, you could simply this arg to:
   `dataset_id=list_datasets_task.output['dataset_id_list']`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1241418808

   Sorry for delay - been a bit busy.
   
   No, It's not compressed -  it's just bundled in .tar now not .zipped (.tar-ing single file kinda make no sense) .  Stil takes 170 instead of 20K (and this PR needs rebase anyway). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1289733383

   conflicts need to be resolved after string normalisation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1386729025

   REbased - static checks fixed in main (mysql python connector release breaking mypy)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1210563449

   Yeah 45K lines of .csv file is NOT something we want. Few options:
   
   1) what happens when you zip the file ? how big it is going to get 
   2) Do we REALLY need as big of a file?
   3) We could easily place it it in our Amazon S3 bucket to download it for the test when needed, we could make it publicly available
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] MaksYermak commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
MaksYermak commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1233870724

   > Can we compress it (and dynamically decompress during test?). Just zipping it is 20K instead of 160K. This file is unlikely to ever change and it is cimpletely uninteresting to see what's in when you review the cod, so there is no particular reason to keep text file in Git.
   > 
   > It's not only the size that matters in this case. Keeping it plain text has this really nasty effect that it when you search something in the source code in your IDE, you will find some matching words here likely, so keeping the file uncompressed make it very prone to falling search&replace victim,
   
   @potiuk I have done it 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1203144638

   Errors :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1205401348

   Rebased to acount for Flask 2.2 errors fixed yesterday.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1385384593

   static check failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] MrGeorgeOwl commented on pull request #25466: Auto ML assets

Posted by "MrGeorgeOwl (via GitHub)" <gi...@apache.org>.
MrGeorgeOwl commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1400125528

   @potiuk I think that PR can be merged. I can't do that because I am not the author of PR and I don't have write access


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1227221240

   Can we compress it (and dynamically decompress during test?). Just zipping it is 20K instead of 160K. This file is unliekly to ever change and it is cimpletely uninteresting to see what's in when you review the cod, so there is no particular reason to keep text file in Git. 
   
   It's not only the size that matters in this case. Keeping it plain text has this really nasty effect that it when you search something in the source code in your IDE, you will find some matching words here likely, so keeping the file uncompressed make it very prone to falling search&replace victim,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] bhirsz commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
bhirsz commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1227073499

   > > Yeah 45K lines of .csv file is NOT something we want. Few options:
   > > 
   > > 1. what happens when you zip the file ? how big it is going to get
   > > 2. Do we REALLY need as big of a file?
   > > 3. We could easily place it it in our Amazon S3 bucket to download it for the test when needed, we could make it publicly available
   > 
   > This .csv is needed for training an AutoML model, in order to start the training .csv should consist more then 1000 rows. For our test I can reduce the file to 2100 rows. @potiuk what do you think about reducing the file size?
   
   @potiuk Catching attention :) I think 2100 is okayish (not the best but certainly better than 50k). Please comment if you still think it should be stored in the external storage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] MaksYermak commented on a diff in pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
MaksYermak commented on code in PR #25466:
URL: https://github.com/apache/airflow/pull/25466#discussion_r941426714


##########
tests/system/providers/google/cloud/automl/example_automl_dataset.py:
##########
@@ -0,0 +1,198 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Example Airflow DAG for Google AutoML service testing dataset operations.
+"""
+import os
+from copy import deepcopy
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, List
+
+from airflow import models
+from airflow.providers.google.cloud.hooks.automl import CloudAutoMLHook
+from airflow.providers.google.cloud.operators.automl import (
+    AutoMLCreateDatasetOperator,
+    AutoMLDeleteDatasetOperator,
+    AutoMLImportDataOperator,
+    AutoMLListDatasetOperator,
+    AutoMLTablesListColumnSpecsOperator,
+    AutoMLTablesListTableSpecsOperator,
+    AutoMLTablesUpdateDatasetOperator,
+)
+from airflow.providers.google.cloud.operators.gcs import GCSCreateBucketOperator, GCSDeleteBucketOperator
+from airflow.providers.google.cloud.transfers.local_to_gcs import LocalFilesystemToGCSOperator
+from airflow.utils.trigger_rule import TriggerRule
+
+ENV_ID = os.environ.get("SYSTEM_TESTS_ENV_ID")
+DAG_ID = "automl_dataset"
+GCP_PROJECT_ID = os.environ.get("SYSTEM_TESTS_GCP_PROJECT", "default")
+
+GCP_AUTOML_LOCATION = "us-central1"
+
+DATA_SAMPLE_GCS_BUCKET_NAME = f"bucket_{DAG_ID}_{ENV_ID}"
+DATA_SAMPLE_GCS_OBJECT_NAME = "automl/bank-marketing/bank-marketing.csv"
+CSV_FILE_LOCAL_PATH = str(Path(__file__).parent / "resources" / "bank-marketing.csv")
+
+DATASET_NAME = f"dataset_{DAG_ID}_{ENV_ID}"
+DATASET = {
+    "display_name": DATASET_NAME,
+    "tables_dataset_metadata": {"target_column_spec_id": ""},
+}
+AUTOML_DATASET_BUCKET = f"gs://{DATA_SAMPLE_GCS_BUCKET_NAME}/{DATA_SAMPLE_GCS_OBJECT_NAME}"
+IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [AUTOML_DATASET_BUCKET]}}
+
+extract_object_id = CloudAutoMLHook.extract_object_id
+
+
+def get_target_column_spec(columns_specs: List[Dict], column_name: str) -> str:
+    """
+    Using column name returns spec of the column.
+    """
+    for column in columns_specs:
+        if column["display_name"] == column_name:
+            return extract_object_id(column)
+    raise Exception(f"Unknown target column: {column_name}")
+
+
+with models.DAG(
+    DAG_ID,
+    schedule_interval="@once",
+    start_date=datetime(2021, 1, 1),
+    catchup=False,
+    tags=["example", "automl"],
+    user_defined_macros={
+        "get_target_column_spec": get_target_column_spec,
+        "target": "Class",
+        "extract_object_id": extract_object_id,
+    },
+) as dag:
+    create_bucket = GCSCreateBucketOperator(
+        task_id="create_bucket",
+        bucket_name=DATA_SAMPLE_GCS_BUCKET_NAME,
+        storage_class="REGIONAL",
+        location=GCP_AUTOML_LOCATION,
+    )
+
+    upload_file = LocalFilesystemToGCSOperator(
+        task_id="upload_file_to_bucket",
+        src=CSV_FILE_LOCAL_PATH,
+        dst=DATA_SAMPLE_GCS_OBJECT_NAME,
+        bucket=DATA_SAMPLE_GCS_BUCKET_NAME,
+    )
+
+    # [START howto_operator_automl_create_dataset]
+    create_dataset = AutoMLCreateDatasetOperator(
+        task_id="create_dataset",
+        dataset=DATASET,
+        location=GCP_AUTOML_LOCATION,
+        project_id=GCP_PROJECT_ID,
+    )
+    dataset_id = create_dataset.output['dataset_id']
+    # [END howto_operator_automl_create_dataset]
+
+    # [START howto_operator_automl_import_data]
+    import_dataset_task = AutoMLImportDataOperator(
+        task_id="import_dataset_task",
+        dataset_id=dataset_id,
+        location=GCP_AUTOML_LOCATION,
+        input_config=IMPORT_INPUT_CONFIG,
+    )
+    # [END howto_operator_automl_import_data]
+
+    # [START howto_operator_automl_specs]
+    list_tables_spec_task = AutoMLTablesListTableSpecsOperator(
+        task_id="list_tables_spec_task",
+        dataset_id=dataset_id,
+        location=GCP_AUTOML_LOCATION,
+        project_id=GCP_PROJECT_ID,
+    )
+    # [END howto_operator_automl_specs]
+
+    # [START howto_operator_automl_column_specs]
+    list_columns_spec_task = AutoMLTablesListColumnSpecsOperator(
+        task_id="list_columns_spec_task",
+        dataset_id=dataset_id,
+        table_spec_id="{{ extract_object_id(task_instance.xcom_pull('list_tables_spec_task')[0]) }}",
+        location=GCP_AUTOML_LOCATION,
+        project_id=GCP_PROJECT_ID,
+    )
+    # [END howto_operator_automl_column_specs]
+
+    # [START howto_operator_automl_update_dataset]
+    update = deepcopy(DATASET)
+    update["name"] = '{{ task_instance.xcom_pull("create_dataset")["name"] }}'
+    update["tables_dataset_metadata"][  # type: ignore
+        "target_column_spec_id"
+    ] = "{{ get_target_column_spec(task_instance.xcom_pull('list_columns_spec_task'), target) }}"
+
+    update_dataset_task = AutoMLTablesUpdateDatasetOperator(
+        task_id="update_dataset_task",
+        dataset=update,
+        location=GCP_AUTOML_LOCATION,
+    )
+    # [END howto_operator_automl_update_dataset]
+
+    # [START howto_operator_list_dataset]
+    list_datasets_task = AutoMLListDatasetOperator(
+        task_id="list_datasets_task",
+        location=GCP_AUTOML_LOCATION,
+        project_id=GCP_PROJECT_ID,
+    )
+    # [END howto_operator_list_dataset]
+
+    # [START howto_operator_delete_dataset]
+    delete_dataset = AutoMLDeleteDatasetOperator(
+        task_id="delete_dataset",
+        dataset_id="{{ task_instance.xcom_pull('list_datasets_task', key='dataset_id_list') | list }}",

Review Comment:
   @josh-fell I have checked your suggestion. When I add this setting some `XComs` breaks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] MaksYermak commented on a diff in pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
MaksYermak commented on code in PR #25466:
URL: https://github.com/apache/airflow/pull/25466#discussion_r942217892


##########
airflow/providers/google/cloud/links/automl.py:
##########
@@ -0,0 +1,161 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""This module contains Google AutoML links."""
+from typing import TYPE_CHECKING, Optional
+
+from airflow.providers.google.cloud.links.base import BaseGoogleLink
+
+if TYPE_CHECKING:
+    from airflow.utils.context import Context
+
+AUTOML_BASE_LINK = "https://console.cloud.google.com/automl-tables"
+AUTOML_DATASET_LINK = (
+    AUTOML_BASE_LINK + "/locations/{location}/datasets/{dataset_id}/schemav2?project={project_id}"
+)
+AUTOML_DATASET_LIST_LINK = AUTOML_BASE_LINK + "/datasets?project={project_id}"
+AUTOML_MODEL_LINK = (
+    AUTOML_BASE_LINK
+    + "/locations/{location}/datasets/{dataset_id};modelId={model_id}/evaluate?project={project_id}"
+)
+AUTOML_MODEL_TRAIN_LINK = (
+    AUTOML_BASE_LINK + "/locations/{location}/datasets/{dataset_id}/train?project={project_id}"
+)
+AUTOML_MODEL_PREDICT_LINK = (
+    AUTOML_BASE_LINK
+    + "/locations/{location}/datasets/{dataset_id};modelId={model_id}/predict?project={project_id}"
+)
+
+
+class AutoMLDatasetLink(BaseGoogleLink):
+    """Helper class for constructing AutoML Dataset link"""
+
+    name = "AutoML Dataset"
+    key = "automl_dataset"
+    format_str = AUTOML_DATASET_LINK
+
+    @staticmethod
+    def persist(
+        context: "Context",
+        task_instance,
+        dataset_id: str,
+        project_id: Optional[str],

Review Comment:
   @josh-fell you are right this parameter is required for links, but for old google operators the project_id is optional. We solve this problem by adding this code `self.project_id or hook.project_id`, but these parameters are optional too and when I make project_id in links required our static checks fail. It is the reason why we make this parameter Optional



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] MaksYermak commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
MaksYermak commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1211959617

   > Yeah 45K lines of .csv file is NOT something we want. Few options:
   > 
   > 1. what happens when you zip the file ? how big it is going to get
   > 2. Do we REALLY need as big of a file?
   > 3. We could easily place it it in our Amazon S3 bucket to download it for the test when needed, we could make it publicly available
   
   This .csv is needed for training an AutoML model, in order to start the training .csv should consist more then 1000 rows. For our test I can reduce the file to 2100 rows. @potiuk what do you think about reducing the file size? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1336540703

   Tests failing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25466: Auto ML assets

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25466:
URL: https://github.com/apache/airflow/pull/25466#issuecomment-1296513193

   Rebased to rebuild.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org