You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "lzdanski (via GitHub)" <gi...@apache.org> on 2024/04/02 18:17:24 UTC

[PR] Data aware scheduling docs edits [airflow]

lzdanski opened a new pull request, #38687:
URL: https://github.com/apache/airflow/pull/38687

   <!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at
   
      http://www.apache.org/licenses/LICENSE-2.0
   
    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
    -->
   
   ---
   
   Copy edits for 2.9 Data Aware Scheduling feature docs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Data aware scheduling docs edits [airflow]

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on code in PR #38687:
URL: https://github.com/apache/airflow/pull/38687#discussion_r1548778668


##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -51,38 +51,38 @@ In addition to scheduling DAGs based upon time, they can also be scheduled based
 What is a "dataset"?
 --------------------
 
-An Airflow dataset is a stand-in for a logical grouping of data. Datasets may be updated by upstream "producer" tasks, and dataset updates contribute to scheduling downstream "consumer" DAGs.
+An Airflow Dataset is a logical grouping of data. Upstream producer tasks can update datasets, and dataset updates contribute to scheduling downstream consumer DAGs.

Review Comment:
   ```suggestion
   An Airflow dataset is a logical grouping of data. Upstream producer tasks can update datasets, and dataset updates contribute to scheduling downstream consumer DAGs.
   ```
   
   This should be uniform. (Can be all capitalized instead but you don’t in all other cases.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Data aware scheduling docs edits [airflow]

Posted by "lzdanski (via GitHub)" <gi...@apache.org>.
lzdanski commented on PR #38687:
URL: https://github.com/apache/airflow/pull/38687#issuecomment-2032743427

   @sunank200 -  Docs copyediting for https://github.com/apache/airflow/pull/37101. I took a look at the PR's you shared with me, https://github.com/apache/airflow/pull/37771 and https://github.com/apache/airflow/pull/38576, and I think copyedits for this content might be covered by https://github.com/apache/airflow/pull/38505.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Data aware scheduling docs edits [airflow]

Posted by "lzdanski (via GitHub)" <gi...@apache.org>.
lzdanski commented on code in PR #38687:
URL: https://github.com/apache/airflow/pull/38687#discussion_r1552304060


##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -51,38 +51,38 @@ In addition to scheduling DAGs based upon time, they can also be scheduled based
 What is a "dataset"?
 --------------------
 
-An Airflow dataset is a stand-in for a logical grouping of data. Datasets may be updated by upstream "producer" tasks, and dataset updates contribute to scheduling downstream "consumer" DAGs.
+An Airflow Dataset is a logical grouping of data. Upstream producer tasks can update datasets, and dataset updates contribute to scheduling downstream consumer DAGs.

Review Comment:
   switched to lowercase - we have it capitalized when it's used in code examples, seemed like a good way to make a quick distinction. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Data aware scheduling docs edits [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk merged PR #38687:
URL: https://github.com/apache/airflow/pull/38687


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Data aware scheduling docs edits [airflow]

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on code in PR #38687:
URL: https://github.com/apache/airflow/pull/38687#discussion_r1548779735


##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -51,38 +51,38 @@ In addition to scheduling DAGs based upon time, they can also be scheduled based
 What is a "dataset"?
 --------------------
 
-An Airflow dataset is a stand-in for a logical grouping of data. Datasets may be updated by upstream "producer" tasks, and dataset updates contribute to scheduling downstream "consumer" DAGs.
+An Airflow Dataset is a logical grouping of data. Upstream producer tasks can update datasets, and dataset updates contribute to scheduling downstream consumer DAGs.
 
-A dataset is defined by a Uniform Resource Identifier (URI):
+Uniform Resource Identifier (URI) define datasets:
 
 .. code-block:: python
 
     from airflow.datasets import Dataset
 
     example_dataset = Dataset("s3://dataset-bucket/example.csv")
 
-Airflow makes no assumptions about the content or location of the data represented by the URI. It is treated as a string, so any use of regular expressions (eg ``input_\d+.csv``) or file glob patterns (eg ``input_2022*.csv``) as an attempt to create multiple datasets from one declaration will not work.
+Airflow makes no assumptions about the content or location of the data represented by the URI, and treats the URI like a string. This means that Airflow treats any regular expressions, like ``input_\d+.csv``, or file glob patterns, such as ``input_2022*.csv``, as an attempt to create multiple datasets from one declaration, and they will not work.
 
-A dataset should be created with a valid URI. Airflow core and providers define various URI schemes that you can use, such as ``file`` (core), ``postgres`` (by the Postgres provider), and ``s3`` (by the Amazon provider). Third-party providers and plugins may also provide their own schemes. These pre-defined schemes have individual semantics that are expected to be followed.
+You must create datasets with a valid URI. Airflow core and providers define various URI schemes that you can use, such as ``file`` (core), ``postgres`` (by the Postgres provider), and ``s3`` (by the Amazon provider). Third-party providers and plugins might also provide their own schemes. These pre-defined schemes have individual semantics that are expected to be followed.
 
 What is valid URI?
 ------------------
 
-Technically, the URI must conform to the valid character set in RFC 3986. If you don't know what this means, that's basically ASCII alphanumeric characters, plus ``%``,  ``-``, ``_``, ``.``, and ``~``. To identify a resource that cannot be represented by URI-safe characters, encode the resource name with `percent-encoding <https://en.wikipedia.org/wiki/Percent-encoding>`_.
+Technically, the URI must conform to the valid character set in RFC 3986, which is basically ASCII alphanumeric characters, plus ``%``,  ``-``, ``_``, ``.``, and ``~``. To identify a resource that cannot be represented by URI-safe characters, encode the resource name with `percent-encoding <https://en.wikipedia.org/wiki/Percent-encoding>`_.

Review Comment:
   We should probably add a link to the Wikipedia entry on URI somewhere too. https://en.wikipedia.org/wiki/Uniform_Resource_Identifier



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Data aware scheduling docs edits [airflow]

Posted by "lzdanski (via GitHub)" <gi...@apache.org>.
lzdanski commented on PR #38687:
URL: https://github.com/apache/airflow/pull/38687#issuecomment-2038233698

   Wait on review from @sunank200 before merging!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Data aware scheduling docs edits [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #38687:
URL: https://github.com/apache/airflow/pull/38687#discussion_r1553922371


##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -23,7 +23,7 @@ Data-aware scheduling
 Quickstart
 ----------
 
-In addition to scheduling DAGs based upon time, they can also be scheduled based upon a task updating a dataset.
+In addition to scheduling DAGs based on time, you can also schedule DAGs to run based on when a task updates a dataset.

Review Comment:
   Since it's waiting for @sunank200 - comment here to prevent accidental merge :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Data aware scheduling docs edits [airflow]

Posted by "sunank200 (via GitHub)" <gi...@apache.org>.
sunank200 commented on code in PR #38687:
URL: https://github.com/apache/airflow/pull/38687#discussion_r1562547462


##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -23,7 +23,7 @@ Data-aware scheduling
 Quickstart
 ----------
 
-In addition to scheduling DAGs based upon time, they can also be scheduled based upon a task updating a dataset.
+In addition to scheduling DAGs based on time, you can also schedule DAGs to run based on when a task updates a dataset.

Review Comment:
   LGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org