You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/14 02:01:21 UTC

[GitHub] [airflow] MattTriano opened a new pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

MattTriano opened a new pull request #22236:
URL: https://github.com/apache/airflow/pull/22236


   While working through the tutorial, some parts seemed a bit unclear or incorrect, which took me to issue #18950 ( converted into discussion #22233 about 15 minutes ago). I had already resolved some of the open concerns and saw potiuk encouraging people to improve the documentation, so I took it upon myself.
   
   Main Changes:
   * Wrote tasks to handle db-table-creation in the DAG rather than leave that as an exercise for the reader.
   * Added a step to ensure the existence of the data output directory in the `get_data()` task.
   * Fixed the logic in the `merge_data()` task so that its behavior matches the stated description.
   
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/main/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] MattTriano commented on pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

Posted by GitBox <gi...@apache.org>.
MattTriano commented on pull request #22236:
URL: https://github.com/apache/airflow/pull/22236#issuecomment-1067379473


   What a `./breeze` (...`build-docs -- --spellcheck-only --package-filter apache-airflow`) that was.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] MattTriano commented on pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

Posted by GitBox <gi...@apache.org>.
MattTriano commented on pull request #22236:
URL: https://github.com/apache/airflow/pull/22236#issuecomment-1067266113


   ```
    ------------------------------ Error   1 --------------------
    WARNING: Block quote ends without a blank line; unexpected unindent.
   
   File path: apache-airflow/tutorial.rst (548)
   
    545 | And from the last line in the definition of the ``Etl`` DAG, we see:
    546 | 
    547 |   >> get_data() >> merge_data()
   >548 | * the ``merge_data()`` task depends on the ``get_data()`` task,
   ------------------------------ Error   2 --------------------
    ERROR: Unknown target name: "brief delay<https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-dir-list-interval>".
   ```
   
   These errors caused the `Build docs` CI job to fail. Technically it also threw a spelling error, but the only error traceback from the `Output of spelling check: apache-airflow` section pointed to line 548 citing the lack of a blank line or unindentation. I'm confident I handled that correctly in my latest commit.
   
   For Error 2, I think the issue was the lack of whitespace between the text and the bracketed-URI, but I'm not too familiar with the reStructuredText spec, so it's possible that I'm wrong and wasting more compute.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #22236:
URL: https://github.com/apache/airflow/pull/22236#issuecomment-1067178667


   The PR is likely ready to be merged. No tests are needed as no important environment files, nor python files were modified by it. However, committers might decide that full test matrix is needed and add the 'full tests needed' label. Then you should rebase it to the latest main or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22236:
URL: https://github.com/apache/airflow/pull/22236#issuecomment-1067175699


   Thanks @MattTriano  -> let's see how the CI likes it :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] MattTriano commented on pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

Posted by GitBox <gi...@apache.org>.
MattTriano commented on pull request #22236:
URL: https://github.com/apache/airflow/pull/22236#issuecomment-1067145934


   I spent a bit more time refactoring (after noticing a thought I left incomplete). I assume most tutorial readers are either trying to get things running on their machine or are evaluating whether it's worth taking the time to try, so I tried to make the example terse but still be batteries-included.
   
   Main changes:
   * I explicitly listed setup commands
   * I moved service startup instructions in front of connection creation instructions
   * I refined descriptions of the task development steps,
   * I tried to use phrasing consistent with other documentation pages,
   * I included links to help readers discover a convenient workflow (eg that updates to DAGs in code on the host machine (but still in the `dags/` dir) would be discovered after a delay without needing to restart any services).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on pull request #22236:
URL: https://github.com/apache/airflow/pull/22236#issuecomment-1066260971


   Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, mypy and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/main/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze environment](https://github.com/apache/airflow/blob/main/BREEZE.rst) for testing locally, itโ€™s a heavy docker but it ships with a working Airflow and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
   - Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
   - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it better ๐Ÿš€.
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://s.apache.org/airflow-slack
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on a change in pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on a change in pull request #22236:
URL: https://github.com/apache/airflow/pull/22236#discussion_r825773905



##########
File path: docs/apache-airflow/tutorial.rst
##########
@@ -378,62 +378,83 @@ Lets look at another example; we need to get some data from a file which is host
 Initial setup
 ''''''''''''''''''''
 We need to have Docker and Postgres installed.
-We will be using this `docker file <https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#docker-compose-yaml>`_
+We will be using the `quick-start docker-compose installation <https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html>`_ for the following steps. 
 Follow the instructions properly to set up Airflow.
 
-You can use the postgres_default connection:
+We will also need to create a `connection <https://airflow.apache.org/docs/apache-airflow/stable/concepts/connections.html>`_ to the postgres db. To create one via the web UI, from the "Admin" menu, select "Connections", then click the Plus sign to "Add a new record" to the list of connections.
 
-- Conn id: postgres_default
-- Conn Type: postgres
+Fill in the fields as shown below. Note the Connection Id value, which we'll pass as a parameter for the ``postgres_conn_id`` kwarg.
+
+- Connection Id: tutorial_pg_conn
+- Connection Type: postgres
 - Host: postgres
 - Schema: airflow
 - Login: airflow
 - Password: airflow
+- Port: 5432
 
+Test your connection and if the test is successful, save your connection.
 
-After that, you can test your connection and if you followed all the steps correctly, it should show a success notification. Proceed with saving the connection. For
-
-
-Open up a postgres shell:
-
-.. code-block:: bash
-
-  ./airflow.sh airflow db shell
+Table Creation Tasks
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Create the Employees table with:
+We can use the `PostgresOperator <https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/operators/postgres_operator_howto_guide.html#creating-a-postgres-database-table>`_ to define tasks that create tables in our postgres db.
 
-.. code-block:: sql
+We'll create one table to serve as the destination for the retrieved data and another to facilitate 
 
-  CREATE TABLE EMPLOYEES
-  (
-      "Serial Number" NUMERIC PRIMARY KEY,
-      "Company Name" TEXT,
-      "Employee Markme" TEXT,
-      "Description" TEXT,
-      "Leave" INTEGER
-  );
+.. code-block:: python
 
-Afterwards, create the Employees_temp table:
+  from airflow.providers.postgres.operators.postgres import PostgresOperator
+   
+  create_employees_table = PostgresOperator(
+      task_id="create_employees_table",
+      postgres_conn_id="tutorial_pg_conn",
+      sql="""
+          CREATE TABLE IF NOT EXISTS employees (
+              "Serial Number" NUMERIC PRIMARY KEY,
+              "Company Name" TEXT,
+              "Employee Markme" TEXT,
+              "Description" TEXT,
+              "Leave" INTEGER
+          );""",
+  )
+  
+  create_employees_temp_table = PostgresOperator(
+      task_id="create_employees_temp_table",
+      postgres_conn_id="tutorial_pg_conn",
+      sql="""
+          DROP TABLE IF EXISTS employees_temp;
+          CREATE TABLE employees_temp (
+              "Serial Number" NUMERIC PRIMARY KEY,
+              "Company Name" TEXT,
+              "Employee Markme" TEXT,
+              "Description" TEXT,
+              "Leave" INTEGER
+          );""",
+  )
 
-.. code-block:: sql
+Optional Note: 
+""""""""""""""
+If you want to abstract these sql statements out of your DAG, you can move the statements sql files somewhere within the ``dags/`` directory and pass the sql file_path (relative to ``dags/``) to the ``sql`` kwarg. For ``employees`` for example, create a ``sql`` directory in ``dags/``, put ``employees`` DDL in ``dags/sql/employees_schema.sql``, and modify the PostgresOperator() to 
 
-  CREATE TABLE EMPLOYEES_TEMP
-  (
-      "Serial Number" NUMERIC PRIMARY KEY,
-      "Company Name" TEXT,
-      "Employee Markme" TEXT,
-      "Description" TEXT,
-      "Leave" INTEGER
-  );
+.. code-block:: python
 
-We are now ready write the DAG.
+  create_employees_table = PostgresOperator(
+      task_id="create_employees_table",
+      postgres_conn_id="tutorial_pg_conn",
+      sql="sql/employees_schema.sql"
+  )
 
+and repeat for the ``employees_temp`` table.
 
+Data Retrival Task

Review comment:
       ```suggestion
   Data Retrieval Task
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #22236:
URL: https://github.com/apache/airflow/pull/22236


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on pull request #22236:
URL: https://github.com/apache/airflow/pull/22236#issuecomment-1067443180


   Awesome work, congrats on your first merged pull request!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22236:
URL: https://github.com/apache/airflow/pull/22236#issuecomment-1067443234


   Thanks! Really cool!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22236: Addressed some issues in the tutorial mentioned in discussion #22233

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22236:
URL: https://github.com/apache/airflow/pull/22236#issuecomment-1067270562


   > For Error 2, I think the issue was the lack of whitespace between the text and the bracketed-URI, but I'm not too familiar with the reStructuredText spec, so it's possible that I'm wrong and wasting more compute.
   
   You can run it locally (see the error output - it explains what commands you can run to build the docs locally) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org