You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/06/29 08:20:36 UTC

[GitHub] [airflow] turbaszek opened a new pull request #9563: Use update instead of merge in _run_raw_task

turbaszek opened a new pull request #9563:
URL: https://github.com/apache/airflow/pull/9563


   Before:
   ```
   [2020-06-26 15:00:26,228] {asserts.py:61} INFO - INSERT INTO rendered_task_instance_fields (dag_id, task_id, execution_date, rendered_fields) VALUES (%(dag_id)s, %(task_id)s, %(execution_date)s, %(rendered_fields)s)
   
   [2020-06-26 15:00:26,232] {asserts.py:61} INFO - SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date AS task_instance_execution_date, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.max_tries AS task_instance_max_tries, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.pool_slots AS task_instance_pool_slots, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_instance.queued_dttm AS task_instance_queued_dttm, task_instance.pid AS task_instance_pid, task_instance.executor_config AS task_instance_executor_config
   FROM task_instance
   WHERE task_instance.task_id = %(param_1)s AND task_instance.dag_id = %(param_2)s AND task_instance.execution_date = %(param_3)s
   
   [2020-06-26 15:00:26,235] {asserts.py:61} INFO - UPDATE task_instance SET end_date=%(end_date)s, state=%(state)s WHERE task_instance.task_id = %(task_instance_task_id)s AND task_instance.dag_id = %(task_instance_dag_id)s AND task_instance.execution_date = %(task_instance_execution_date)s
   
   [2020-06-26 15:00:26,238] {asserts.py:61} INFO - INSERT INTO log (dttm, dag_id, task_id, event, execution_date, owner, extra) VALUES (%(dttm)s, %(dag_id)s, %(task_id)s, %(event)s, %(execution_date)s, %(owner)s, %(extra)s) RETURNING log.id
   ```
   After:
   ```
   [2020-06-26 15:00:57,858] {asserts.py:61} INFO - INSERT INTO rendered_task_instance_fields (dag_id, task_id, execution_date, rendered_fields) VALUES (%(dag_id)s, %(task_id)s, %(execution_date)s, %(rendered_fields)s)
   
   [2020-06-26 15:00:57,863] {asserts.py:61} INFO - UPDATE task_instance SET end_date=%(end_date)s, state=%(state)s WHERE task_instance.dag_id = %(dag_id_1)s AND task_instance.task_id = %(task_id_1)s AND task_instance.execution_date = %(execution_date_1)s
   
   [2020-06-26 15:00:57,865] {asserts.py:61} INFO - INSERT INTO log (dttm, dag_id, task_id, event, execution_date, owner, extra) VALUES (%(dttm)s, %(dag_id)s, %(task_id)s, %(event)s, %(execution_date)s, %(owner)s, %(extra)s) RETURNING log.id
   ```
   
   ---
   Make sure to mark the boxes below before creating PR: [x]
   
   - [x] Description above provides context of the change
   - [x] Unit tests coverage for changes (not needed for documentation changes)
   - [x] Target Github ISSUE in description if exists
   - [x] Commits follow "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)"
   - [x] Relevant documentation is updated including usage instructions.
   - [x] I will engage committers as explained in [Contribution Workflow Example](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#contribution-workflow-example).
   
   ---
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   Read the [Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines) for more information.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on pull request #9563: Use update instead of merge in _run_raw_task

Posted by GitBox <gi...@apache.org>.
turbaszek commented on pull request #9563:
URL: https://github.com/apache/airflow/pull/9563#issuecomment-651044767


   > I think we should use merge, mainly because then we don't need to care about which fields are updated for a TI.
   
   I agree with this in general. Merge is really useful in many cases but it's not always an optimal solution. In this case, the question is how seriously we want to pay attention to Airflow performance. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek closed pull request #9563: Use update instead of merge in _run_raw_task

Posted by GitBox <gi...@apache.org>.
turbaszek closed pull request #9563:
URL: https://github.com/apache/airflow/pull/9563


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #9563: Use update instead of merge in _run_raw_task

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #9563:
URL: https://github.com/apache/airflow/pull/9563#issuecomment-651045892


   I think it depends - indeed we are trading-of a bit of performance for the explicit query. I am personally in favour of being more explicit (like in python's Zen) of what's going on - especially that this is really what happens here - we expect to only update those values. And actually I find it a bit easier to reason on what's going on here rather than rely on "whatever happens to the object, let's merge it here).  But this is not a super-strong opinion and I am ok with either.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #9563: Use update instead of merge in _run_raw_task

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #9563:
URL: https://github.com/apache/airflow/pull/9563#issuecomment-651054536


   What happens if we did `session.add(self)` instead? We already have an TI object that should be updated, so we could likely just let SQLA handle this all for us.
   
   In terms of performance: I don't think we should aim to reduce the number of queries above all else - our primary goal should be clarify of code/intent, and only for critical/hot paths should we then go for performance above readable code, and as this is so far off the critical path (as it happens on the worker) that this won't have much if any impact on performance of Airflow as a whole. And even for a single task this only adds micro-seconds (one DB round trip)
   
   At the very least if we go with this approach we should have a clear comment saying _why_ we are using this approach.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on pull request #9563: Use update instead of merge in _run_raw_task

Posted by GitBox <gi...@apache.org>.
kaxil commented on pull request #9563:
URL: https://github.com/apache/airflow/pull/9563#issuecomment-651049545


   > > I think we should use merge, mainly because then we don't need to care about which fields are updated for a TI.
   > 
   > I agree with this in general. Merge is really useful in many cases but it's not always an optimal solution. In this case, the question is how seriously we want to pay attention to Airflow performance.
   
   Yes I have no strong opinion either and by no means, I am an expert on SQL alchemy state handling :) so I will be very happy to learn anything new from the discussion.
   
   My only concern is about other attributes like job_id which is updated a few lines about, do we need to add job_id too in that case?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on pull request #9563: Use update instead of merge in _run_raw_task

Posted by GitBox <gi...@apache.org>.
turbaszek commented on pull request #9563:
URL: https://github.com/apache/airflow/pull/9563#issuecomment-651078760


   Closing as this improvement requires more custom logic that it is worth 😉 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on pull request #9563: Use update instead of merge in _run_raw_task

Posted by GitBox <gi...@apache.org>.
turbaszek commented on pull request #9563:
URL: https://github.com/apache/airflow/pull/9563#issuecomment-651065623


   > What happens if we did `session.add(self)` instead? We already have an TI object that should be updated, so we could likely just let SQLA handle this all for us.
   
   As Kaxil mentioned: `Original exception was: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "task_instance_pkey"` even if there's no ti in session.
   
   > In terms of performance: I don't think we should aim to reduce the number of queries above all else - our primary goal should be clarify of code/intent, and only for critical/hot paths should we then go for performance above readable code, and as this is so far off the critical path (as it happens on the worker) that this won't have much if any impact on performance of Airflow as a whole. And even for a single task this only adds micro-seconds (one DB round trip)
   
   The readability can be improved by extracting such logic to custom method/function (as we do in many places). I agree that at first look this is not a breakthrough in performance but on the other hand it is an enhancement. 
   
   > At the very least if we go with this approach we should have a clear comment saying _why_ we are using this approach.
   
   I agree.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil edited a comment on pull request #9563: Use update instead of merge in _run_raw_task

Posted by GitBox <gi...@apache.org>.
kaxil edited a comment on pull request #9563:
URL: https://github.com/apache/airflow/pull/9563#issuecomment-651049545


   > > I think we should use merge, mainly because then we don't need to care about which fields are updated for a TI.
   > 
   > I agree with this in general. Merge is really useful in many cases but it's not always an optimal solution. In this case, the question is how seriously we want to pay attention to Airflow performance.
   
   Yes I have no strong opinion either and by no means am I an expert on SQL alchemy state handling :) so I will be very happy to learn anything new from the discussion.
   
   My only concern is about other attributes like job_id which is updated a few lines about, do we need to add job_id too in that case?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on pull request #9563: Use update instead of merge in _run_raw_task

Posted by GitBox <gi...@apache.org>.
turbaszek commented on pull request #9563:
URL: https://github.com/apache/airflow/pull/9563#issuecomment-651078235


   I did a test with no TI in db and now I think using merge is better because it will create the TI like this:
   ```
   [2020-06-29 12:11:54,824] {asserts.py:61} INFO - INSERT INTO task_instance (task_id, dag_id, execution_date, start_date, end_date, duration, state, try_number, max_tries, hostname, unixname, job_id, pool, pool_slots, queue, priority_weight, operator, queued_dttm, pid, executor_config) VALUES (%(task_id)s, %(dag_id)s, %(execution_date)s, %(start_date)s, %(end_date)s, %(duration)s, %(state)s, %(try_number)s, %(max_tries)s, %(hostname)s, %(unixname)s, %(job_id)s, %(pool)s, %(pool_slots)s, %(queue)s, %(priority_weight)s, %(operator)s, %(queued_dttm)s, %(pid)s, %(executor_config)s)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org