You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/09/01 21:38:33 UTC

[GitHub] [airflow] dimon222 opened a new issue #17897: Dag tags are not refreshing if case sensitivity changed

dimon222 opened a new issue #17897:
URL: https://github.com/apache/airflow/issues/17897


   ### Apache Airflow version
   
   2.1.3 (latest released)
   
   ### Operating System
   
   Debian
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   Custom docker image on k8s
   
   ### What happened
   
   New dag version was written in same named dag. The only difference was one of specified tags got changed in case, for ex. "test" to "Test". With huge amount of dags this causes scheduler immediately to crash due to constraint violation.
   
   ### What you expected to happen
   
   Tags are refreshed correctly without crash of scheduler.
   
   ### How to reproduce
   
   1. Create dag with tags in running airflow cluster
   2. Update dag with change of case of one of tags for ex. "test" to "Test"
   3. Watch scheduler crash continuously
   
   ### Anything else
   
   Alternative option is to make tags case sensitive...
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910724279


   1. Mariadb 10.2.22 (the issue was also noticed on mysql 8 tho, so I'm not certain if it's locking related).
   2. Use row locking is set to True.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910795269


   Sure, I understand I'm somewhat on my own when it comes to MariaDB. I was able to see this issue on Mysql 8 as well, and I have received exactly same stacktrace there. I myself noticed it on Mysql if it wasn't clear from initial message :)
   
   Transaction isolation is decided on session level, so I would assume Airflow/Sqlalchemy internals themselves set some, so it shouldn't matter for me as a user of airflow?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914607025


   https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html?highlight=collation you can find it here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910851491


   Here's what I tried just now (one at a time, of course)
   1. Above mentioned manager.py correction
   2. Restrict amount of parsing processes to 1.
   3. Completely bring scheduler down for time of updating dag tags, then bring it back up
   4. Increase min_file_process_interval
   5. Change "use_row_level_locking" to False
   
   In all the possible options I'm receiving the above mentioned SQL error. If I decrease amount of processes to 1 the scheduler immediately chonks, spits this error and exits with non-zero exit code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914582333


   > Just one more question - can you see the same if you just "change" the tag (i.e. no case-sensitive-only change) "test" -> "test1" for example? Since we cannot reproduce it, I am looking for a way to find out if it might be a DB configuration of some sorts.
   Doesn't happen. Completely different tag is fine, but same tag with different case is not. I was hoping its some kind case insensitive primary key, but I see no indication of existence of such term at all.
   
   > 
   
   
   
   > Very, very interesting. Curious to see it with the fix. Is it also possible that you add two wanings to those two sides of the condition and see the Warnings generated? (there will be a lot of them but I wonder whether the "if" statement works as expected in your case)
   > 
   > https://github.com/apache/airflow/blob/430976caad5970b718e3dbf5899d4fc879c0ac89/airflow/utils/sqlalchemy.py#L247
   
   Both clauses in `if` are reporting match as True.
   The issue is happening even in Mysql 8, I tested even with proposed above manager.py correction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910851491


   Here's what I tried just now (one at a time, of course)
   1. Above mentioned manager.py correction
   2. Restrict amount of parsing processes to 1.
   3. Completely bring scheduler down for time of updating dag tags, then bring it back up
   4. Increase min_file_process_interval
   5. Change "use_row_level_locking" to False (I have it as True initially)
   
   In all the possible options I'm receiving the above mentioned SQL error. If I decrease amount of processes to 1 the scheduler immediately chonks, spits this error and exits with non-zero exit code.
   
   I have also enabled debug logging and tried to see what DAGs are being picked, and I don't seem to notice anything that might reflect concurrency problems (or picking up same dag twice). Unless, the problem is with logic that does bulk insert ? Like doubling amount of queries or such.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914671906


   @potiuk 
   Well, when I briefly read collation docs I was kind of similar feeling that they went nuts on this.
   I will avoid going much offtop, but in my opinion there's a reason MySQL and MariaDB still pop:
   1. Legacy and modern world living together (thats what I do for life, not sure yet if its for good or for bad). People who yesterday just jumped off Oracle, transitioned to MySQL to find out that there's Postgres waiting for them.
   2. MariaDB/MySQL are also still there because of on-premise enterprise support barely offered for Postgres solutions. Yeah, we all need not just automatic backups, but scalable clusters that can recover immediately without 50 stars github application doing some behind the scene meddling. All that have to work on-premise and be reasonably priced. (lmao)
   
   Incase this topic "mysql/mariadb, why and should we" does need more discussion (hey, I still remember how hard was Gitlab at making this decision!), I would suggest transition to Slack thread or perhaps separate dedicated Issue discussion.
   
   PS: for tests I was meaning just literally create dummy dag + tag, and update tag to tag in another case. See if its able to replicate this use case with built-in methods.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914644779


   Since table for me was created long back in Airflow 1 times, the option for collation in airflow.cfg wasn't much useful, as something have to convert all the existing indexes.
   
   I have done more tests and then evaluated the long ago created DDL of `dag_tag` table.  The charset for it was set latin1, so I had to use corresponding case aware collation `latin1_bin`. As it has foreign key reference, the same collation is needed on `dag` table. Afterall, I was able to do the conversion and finally, success, scheduler no longer crashes and does refresh tags correctly.
   
   Overall the procedure for anyone else who might encounter it:    
   1. Check current charset of `dag_tag`, find matching _bin collation with case awareness
   2. Drop foreign key constraint on `dag_tag`
   3. Convert charset with new collate on `dag`
   4. Convert charset with new collate on `dag_tag`
   5. Add foreign key constraint back to `dag_tag`
   
   Let me know if there's less painful way. I haven't yet identified if it's going to cause other issues like joins between tables of dag and anything else (I hope it won't....), otherwise it would mean collation have to be converted everywhere with dag_id
   
   PS: solution works for both mariadb and mysql


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910671083


   The issue was initially noticed on 2.1.2 version, but I have given it a test on version 2.1.3 now and I'm having it.
   
   Perhaps, this might point in this direction:
   1. Store dags in DB option is enabled in airflow.cfg
   2. To make scheduler crash it should be a lot of dags with this problem so that all processes that are doing parsing kicked in simultaneously and raised exception. I tried it with just single dag and it didn't crash the scheduler (tho, spammed a lot of stacktraces in scheduler and the dag is essentially broken).
   
   The exact stacktrace is very long but here's the most important part of it:
   ```
   ...
   File "/venv/lib/python3.6/site-packages/airflow/models/dag.py", line 1920, in bulk_write_to_db
       DagCode.bulk_sync_to_db([dag.fileloc for dag in orm_dags])
   ...
   File "/venv/lib/python3.6/site-packages/MySQLdb/connections.py", line 259, in query
      _mysql.connection.query(self, query)
   sqlalchemy.exc.IntegrityError: (MySQLdb._exceptions.IntegrityError) (1062, "Duplicate entry "Test-testdag" for key 'PRIMARY'")
   [SQL: INSERT INTO dag_tag (name, dag_id) VALUES (%s, %s)]
   [parameters: ('Test', 'testdag')]
   (Background on this error at: http://sqlalche.me/e/13/gpkj )
   ```
   
   The dag initially had tags `['test']` and then the tag listing was changed to `['Test']`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910833359


   Some update @ashb  after more looking. I realized that IF I am right about the hypothesis, we'd heave much bigger problem for multiple schedulers. so I looked a bit deeper. While my fix is likely to help with 1 scheduler, it would not help for more of them.
   
   But I found that this should actually prevent updating TAGs for the same dag_id by two parallell parsing processes (regardless if in one or more schedulers), So I think the only explanation I see now is that row-level locking is entirely disabled (@dimon222 - do you actually have it disabled (scheduler/use_row_level_locking)?):
   ```
           query = (
               session.query(DagModel)
               .options(joinedload(DagModel.tags, innerjoin=False))
               .filter(DagModel.dag_id.in_(dag_ids))
           )
           orm_dags = with_row_locks(query, of=DagModel, session=session).all()
   ``


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914589729


   Upon further research I have found term `collate` that seems to be set to case insensitive `utf8_general_ci`  by default for connection, database and server. I'm currently checking why and what defines it. Perhaps, good addition might be to explicitly specify collate in Airflow schema creation scripts if it's not there yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910705755


   1. Single scheduler only
   2. 2 parsing processes
   3. I was able to replicate issue on cluster that had just 3 dags.
   4. sort_mode is modified_time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910851491


   Here's what I tried just now (one at a time, of course)
   1. Above mentioned manager.py correction
   2. Restrict amount of parsing processes to 1.
   3. Completely bring scheduler down for time of updating dag tags, then bring it back up
   4. Increase min_file_process_interval
   5. Change "use_row_level_locking" to False (I have it as True initially)
   
   In all the possible options I'm receiving the above mentioned SQL error. If I decrease amount of processes to 1 the scheduler immediately chonks, spits this error and exits with non-zero exit code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910671083


   The issue was initially noticed on 2.1.2 version, but I have given it a test on version 2.1.3 now and I'm having it.
   
   Perhaps, this might point in this direction:
   1. Store dags in DB option is enabled in airflow.cfg
   2. To make scheduler crash it should be a lot of dags with this problem so that all processes that are doing parsing kicked in simultaneously and raised exception. I tried it with just single dag and it didn't crash the scheduler (tho, spammed a lot of stacktraces in scheduler and the dag is essentially broken).
   
   The exact stacktrace is very long but here's the most important part of it:
   ```
   ...
   File "/venv/lib/python3.6/site-packages/airflow/models/dag.py", line 1920, in bulk_write_to_db
       DagCode.bulk_sync_to_db([dag.fileloc for dag in orm_dags])
   ...
   File "/venv/lib/python3.6/site-packages/MySQLdb/connections.py", line 259, in query
      _mysql.connection.query(self, query)
   sqlalchemy.exc.IntegrityError: (MySQLdb._exceptions.IntegrityError) (1062, "Duplicate entry "Test-testdag" for key 'PRIMARY'")
   [SQL: INSERT INTO dag_tag (name, dag_id) VALUES (%s, %s)]
   [parameters: ('Test', 'testdag')]
   (Background on this error at: http://sqlalche.me/e/13/gpkj )
   ```
   ```
   The dag initially had tags `['test']` and then the tag listing was changed to `['Test']`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910755044


   Ah. That's interesting news, Mariadb is not officially supported - you know that BTW ? Technically it has full right to not work, so I I am not sure if we will be able to help here because we simply do not test Airflow on Maria DB. What do you mean it's been noticed ? Who noticed it ? do you have some logs and observations from MySQL as well?
   
   I think - BTW,  it has nothing to do with row locking because it happens during parsing rather than row locking, and locking is only used during scheduling, once the DAGs are already in the DB and that does not touch tags.
   
   I was asking more for transaction isolation level than row locking: 
   https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html
   
   or MariaDB: https://mariadb.com/kb/en/mariadb-transactions-and-isolation-levels-for-sql-server-users/
   
   Can you double check that please?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
SamWheating edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-909728670


   I am unable to reproduce this issue on the current main branch of Airflow as well as on version 2.1.3.
   
   Could you share the scheduler logs surrounding the scheduler crash? Seeing the actual constraint violation which occurred might help to understand whats going on here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914644779


   Since table for me was created long back in Airflow 1 times, the option for collation in airflow.cfg wasn't much useful, as something have to convert all the existing indexes.
   
   I have done more tests and then evaluated the long ago created DDL of `dag_tag` table.  The charset for it was set latin1, so I had to use corresponding case aware collation `latin1_bin`. As it has foreign key reference, the same collation is needed on `dag` table. Afterall, I was able to do the conversion and finally, success, scheduler no longer crashes and does refresh tags correctly.
   
   Overall the procedure for anyone else who might encounter it:    
   1. Check current charset, find matching _bin collation with case awareness
   2. Drop foreign key constraint on `dag_tag`
   3. Convert charset with new collate on `dag`
   4. Convert charset with new collate on `dag_tag`
   5. Add foreign key constraint back to `dag_tag`
   
   Let me know if there's less painful way. I haven't yet identified if it's going to cause other issues like joins between tables of dag and anything else (I hope it won't....), otherwise it would mean collation have to be converted everywhere with dag_id
   
   PS: solution works for both mariadb and mysql


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914621919


   Thanks! For being persistent. Glad that (I think) we know the problem now. I am adding a PR to fix our docs with it! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914644779


   Since table for me was created back then, the option for collation in airflow.cfg wasn't much useful, as something have to convert all the indexes.
   
   I have done more tests and then evaluated the long ago created DDL of `dag_tag` table.  The charset for it was set latin1, so I had to use corresponding case aware collation `latin1_bin`. As it has foreign key reference, the same collation is needed on `dag` table. Afterall, I was able to do the conversion and finally, success, scheduler no longer crashes and does refresh tags correctly.
   
   Overall the procedure for anyone else who might encounter it:    
   1. Check current charset, find matching _bin collation with case awareness
   2. Drop foreign key constraint on `dag_tag`
   3. Convert charset with new collate on `dag`
   4. Convert charset with new collate on `dag_tag`
   5. Add foreign key constraint back to `dag_tag`
   
   Let me know if there's less painful way. I haven't yet identified if it's going to cause other issues like joins between tables of dag and anything else (I hope it won't....), otherwise it would mean collation have to be converted everywhere with dag_id
   
   PS: solution works for both mariadb and mysql


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910851491


   Here's what I tried just now (one at a time, of course)
   1. Above mentioned manager.py correction
   2. Restrict amount of parsing processes to 1.
   3. Completely bring scheduler down for time of updating dag tags, then bring it back up
   4. Increase min_file_process_interval
   5. Change "use_row_level_locking" to False
   
   In all the possible options I'm receiving the above mentioned SQL error. If I decrease amount of processes to 1 the scheduler immediately chonks and goes exits with non-zero exit code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914610246


   > You likely have this set as `sql_engine_collation_for_ids=utf8mb3_general_ci` parameter in airflow.cfg
   
   Worse, it's commented away in my config :D
   Perhaps was missed during airflow 2 upgrade.
   
   I will test this setting in both mysql and mariadb shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910851491


   Here's what I tried just now (one at a time, of course)
   1. Above mentioned manager.py correction
   2. Restrict amount of parsing processes to 1.
   3. Completely bring scheduler down for time of updating dag tags, then bring it back up
   4. Increase min_file_process_interval
   5. Change "use_row_level_locking" to False (I have it as True initially)
   6. Change sort mode to alphabetical
   7. Tune other timeouts
   
   In all the possible options I'm receiving the above mentioned SQL error. If I decrease amount of processes to 1 the scheduler immediately chonks, spits this error and exits with non-zero exit code.
   
   I have also enabled debug logging and tried to see what DAGs are being picked, and I don't seem to notice anything that might reflect concurrency problems (or picking up same dag twice). Unless, the problem is with logic that does bulk insert ? Like doubling amount of queries or such. I tried for disable Sqlalchemy pool, print all the tags it detects - everything seems fine from collected logs, but issue still arises. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914609904


   All right . It seems that MySQL is worse than I thought as they have no good utf8 case sensitive collation (!) https://stackoverflow.com/questions/4558707/case-sensitive-collation-in-mysql - the best you can (likely) do is `utf8_bin` (it messes order-by but in most cases it will be ok when no accented characters are used as ids they will be alphabetically sorted.
   
   God. I HATE Mysql for all that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #17897:
URL: https://github.com/apache/airflow/issues/17897


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914621919


   Thanks! Also for being persistent. Glad that (I think) we know the problem now. I am adding a PR to fix our docs with it! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910789621


   @ashb  that might be an interesting one - and need your second pair of eyes here (and your expert knowldge of that part).
   
   I looked at the code and the only explanation I have (regardless of the isolation level question) is that when you have two parsers and when you change a DAG file 'quickly" you might technically end up with two parsers loading same file in parallel (and subsequently failing on updating TAGs. Tag Updating is done in several steps with DELETES and INSERTS and it looks like pretty interesting (and likely) race condition. 
   
   My hypothesis:
   
   1) first pass of the DagFileProcessorManager loop finds a DAG file to parse - adds it to the _file_path_queue (but other files are being parsed so it is not picked up yet)
   
   2) the DAG file gets changed (tag is renamed to a new name)
   
   3) next pass of parser loop finds again the same file (sorted by modification time reverse so it is found first) . in case of  modified_time, the "min_file_process_interval" is not honored so it will be picked up immediately and added to the list.
   The check in DagFileProcessorManager does not include self.__file_path_queue - so I believe if the same file is already in the queue, it will be added for the second time (this queue is a list):
   ```
           file_paths_to_exclude = set(file_paths_in_progress).union(
               file_paths_recently_processed, files_paths_at_run_limit
           )
   ```
   
   4) Then the same file appears twice in the self._file_path_queue and if they are next to each other in the queue and so it happens that they are picked up by two parallel parser processes. By the time those parsers pick them up they pick the same version of the files and they try to insert the same TAG for the same dag -> hence the duplicate entry. 
   
   Does it sound plausible? Did I miss something? I think the solution for that will be:
   
   ```
           file_paths_to_exclude = set(file_paths_in_progress).union(
               file_paths_recently_processed, files_paths_at_run_limit
           ).union(set(self._file_path_queue))
   ```
   
   Does it look legit ? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
SamWheating edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-909728670


   I am unable to reproduce this issue on the current main branch of Airflow as well as on version 2.1.3.
   
   Could you share the scheduler logs surrounding the scheduler crash? Seeing the actual constraint violation which occurred might help to understand whats going on here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-911947035


   Just tested on Mysql 8.0.11 and was able to replicate exactly same scenario.
   
   As reminder,
   1. Store DAGs to True
   2. Create dag with tags, let airflow pick it up
   3. Edit 1 tag in that dag by changing only case (test -> Test)
   4. Scheduler fails to populate updated tags and giant stacktrace of duplicate row.
   
   Testing the provided change in manager.py now...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910789621


   @ashb  that might be an interesting one - and need your second pair of eyes here (and your expert knowldge of that part).
   
   I looked at the code and the only explanation I have (regardless of the isolation level question) is that when you have two parsers and when you change a DAG file 'quickly" you might technically end up with two parsers loading two different versions of the same file (and subsequently failing on updating TAGs. Tag Updating is done in several steps with DELETES and INSERTS and it looks like pretty interesting (and likely) race condition. 
   
   My hypothesis:
   
   1) first pass of lDagFileProcessorManager oop finds a DAG file to parse - adds it to the _file_path_queue (but other files are being parsed so it is not picked up yet)
   
   2) the DAG file gets changed (tag is renamed to a new name)
   
   3) next pass of parser loop finds again the same file (sorted by modification time reverse so it is found first) . in case of  modified_time, the "min_file_process_interval" is not honored so it will be picked up immediately and added to the list.
   The check in DagFileProcessorManager does not include self.__file_path_queue - so I believe if the same file is already in the queue, it will be added for the second time (this queue is a list):
   ```
           file_paths_to_exclude = set(file_paths_in_progress).union(
               file_paths_recently_processed, files_paths_at_run_limit
           )
   ```
   
   4) Then the same file appears twice in the self._file_path_queue and if they are next to each other in the queue and so it happens that they are picked up by two parallel parser processes. By the time those parsers pick them up they pick the same version of the files and they try to insert the same TAG for the same dag -> hence the duplicate entry. 
   
   Does it sound plausible? Did I miss something? I think the solution for that will be:
   
   ```
           file_paths_to_exclude = set(file_paths_in_progress).union(
               file_paths_recently_processed, files_paths_at_run_limit
           ).union(set(self._file_path_queue))
   ```
   
   Does it look legit ? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910697939


   This looks like a problem when some parses are still parsing the old versions and then some other parsers kicking in to parse the new ones, working in parallel. I do not think it's the matter of changing case. I think it's the matter of changing tag for the same dag, How many schedulers you have and what's the number of parsing processes you have  ? 
   https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#parsing-processes
   Also what's your parsing sort_mode (https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#file-parsing-sort-mode) ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914582333


   > Just one more question - can you see the same if you just "change" the tag (i.e. no case-sensitive-only change) "test" -> "test1" for example? Since we cannot reproduce it, I am looking for a way to find out if it might be a DB configuration of some sorts.
   
   Doesn't happen. Completely different tag is fine, but same tag with different case is not. I was hoping its some kind case insensitive primary key, but I see no indication of existence of such term at all.
   
   
   > Very, very interesting. Curious to see it with the fix. Is it also possible that you add two wanings to those two sides of the condition and see the Warnings generated? (there will be a lot of them but I wonder whether the "if" statement works as expected in your case)
   > 
   > https://github.com/apache/airflow/blob/430976caad5970b718e3dbf5899d4fc879c0ac89/airflow/utils/sqlalchemy.py#L247
   
   Both clauses in `if` are reporting match as True.
   The issue is happening even in Mysql 8, I tested even with proposed above manager.py correction today and still no success.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910833359


   Some update @ashb  after more looking. I realized that IF I am right about the hypothesis, we'd heave much bigger problem for multiple schedulers. so I looked a bit deeper. While my fix is likely to help with 1 scheduler, it would not help for more of them.
   
   But I found that this should actually prevent updating TAGs for the same dag_id by two parallell parsing processes (regardless if in one or more schedulers), So I think the only explanation I see now is that row-level locking is entirely disabled (@dimon222 - do you actually have it disabled ([scheduler]  -> use_row_level_locking)?):
   ```
           query = (
               session.query(DagModel)
               .options(joinedload(DagModel.tags, innerjoin=False))
               .filter(DagModel.dag_id.in_(dag_ids))
           )
           orm_dags = with_row_locks(query, of=DagModel, session=session).all()
   ``


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914653325


   @potiuk I would highly recommend to also add test case for such scenario... Just incase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914606789


   Ah. That's actually great finding. I had no idea ci stands for case-insensitive !. 
   
   
   We do have utf8mb3_general_ci mentioned in the docs as workaround to set for `utf8mb4` encoding which breaks MySQL installation due to too large index size.
   
   I will update the docs accordingly, especially tha this encoding is set by default now in the upcoming Airflow 2
   
   You likely have this set  as `sql_engine_collation_for_ids=utf8mb3_general_ci` parameter in airflow.cfg


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910755044


   Ah. That's interesting news, Mariadb is not officially supported - you know that BTW ? Technically it has full right to not work, so I I am not sure if we will be able to help here because we simply do not test Airflow on Maria DB. What do you mean it's been noticed ? Who noticed it ? do you have some logs and observations from MySQL as well?
   
   I think - BTW,  it has nothing to do with row locking because it happens during parsing rather than scheduling, and locking is only used during scheduling, once the DAGs are already in the DB and that does not touch tags.
   
   I was asking more for transaction isolation level than row locking: 
   https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html
   
   or MariaDB: https://mariadb.com/kb/en/mariadb-transactions-and-isolation-levels-for-sql-server-users/
   
   Can you double check that please?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914623151


   Added #18072 - I will let it run all tests (part of our CI is to create MySQL DB with the collation) and see if it works.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #17897:
URL: https://github.com/apache/airflow/issues/17897


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910796933


   In the meantime. @dimon222  - is it possible that you try my proposed fix, to see if it actually fixes it ? You'd need to modify the airflow/dag_processing/manager.py line ~994 (not sure in your version) 
   
   ```
           file_paths_to_exclude = set(file_paths_in_progress).union(
               file_paths_recently_processed, files_paths_at_run_limit
           )
   ```
   
   -> 
   
   ```
           file_paths_to_exclude = set(file_paths_in_progress).union(
               file_paths_recently_processed, files_paths_at_run_limit
           ).union(set(self._file_path_queue))
   ```
   
   And yeah. I think - after deeper looking - that transaction isolation would not explain it (it might explain phantom reads but not really phantom writes that you experience.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914610246


   > You likely have this set as `sql_engine_collation_for_ids=utf8mb3_general_ci` parameter in airflow.cfg
   
   Worse, it's commented away in my config :D
   Perhaps was missed during airflow 2 upgrade.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914644779


   Since table for me was created back then, the option for collation in airflow.cfg wasn't much useful, as something have to convert all the indexes.
   
   I have done more tests and then evaluated the long ago created DDL of `dag_tag` table.  The charset for it was set latin1, so I had to use corresponding case aware collation `latin1_bin`. As it has foreign key reference, the same collation is needed on `dag` table. Afterall, I was able to do the conversion and finally, success, scheduler no longer crashes and does refresh tags correctly.
   
   Overall the procedure for anyone else who might encounter it:    
   1. Check current charset, find matching _bin collation with case awareness
   2. Drop foreign key constraint on `dag_tag`
   3. Convert charset with new collate on `dag`
   4. Convert charset with new collate on `dag_tag`
   5. Add foreign key constraint back to `dag_tag`
   
   Let me know if there's less painful way. I haven't yet identified if it's going to cause other issues like joins between tables of dag and anything else (I hope it won't....), otherwise it would mean collation have to be converted everywhere with dag_id


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914589729


   Upon further research I have term `collate` that seems to be set to case insensitive `utf8_general_ci`  by default for connection, database and server. I'm currently checking why and what defines it. Perhaps, good addition might be to explicitly specify collate in Airflow schema creation scripts if it's not there yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-909728670


   I was unable to reproduce this issue on the current main branch of Airflow, could you share the scheduler logs surrounding the scheduler crash?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914671906


   @potiuk 
   Well, when I briefly read collation docs I was kind of similar feeling that they went nuts on this.
   I will avoid going much offtop, but in my opinion there's a reason MySQL and MariaDB still pop:
   1. Legacy and modern world living together (thats what I do for life, not sure yet if its for good or for bad). People who yesterday just jumped off Oracle, transitioned to MySQL to find out that there's Postgres waiting for them.
   2. MariaDB/MySQL are also still there because of on-premise enterprise support barely offered for Postgres solutions. Yeah, we all need not just automatic backups, but scalable clusters that can recover immediately without 50 stars github application doing some behind the scene meddling. All that have to work on-premise and be reasonably priced. (lmao)
   
   Incase this topic "mysql/mariadb, why and should we" does need more discussion (hey, I still remember how hard was Gitlab at making this decision!), I would suggest transition to Slack thread or perhaps separate dedicated Issue discussion.
   
   PS: for tests I was meaning just literally create dummy dag, and update tag to tag in another case. See if its able to replicate this use case with built-in methods.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-911378783


   CAn you try it again on MySQL 8  ? You did mention that you saw it before, but somehow I have a feeling that this is REALLY a MariaDB problem. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-911947035


   Just tested on Mysql 8.0.11 and was able to replicate exactly same scenario.
   
   As reminder,
   1. Store DAGs to True
   2. Create dag with tags, let airflow pick it up
   3. Edit 1 tag in that dag by changing only case (test -> Test)
   4. Scheduler fails to populate updated tags and giant stacktrace of duplicate row.
   
   I will try testing the provided change in manager.py now now with Mysql...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910789621


   @ashb  that might be an interesting one - and need your second pair of eyes here (and your expert knowldge of that part).
   
   I looked at the code and the only explanation I have (regardless of the isolation level question) is that when you have two parsers and when you change a DAG file 'quickly" you might technically end up with two parsers loading same file in parallel (and subsequently failing on updating TAGs. Tag Updating is done in several steps with DELETES and INSERTS and it looks like pretty interesting (and likely) race condition. 
   
   My hypothesis:
   
   1) first pass of lDagFileProcessorManager oop finds a DAG file to parse - adds it to the _file_path_queue (but other files are being parsed so it is not picked up yet)
   
   2) the DAG file gets changed (tag is renamed to a new name)
   
   3) next pass of parser loop finds again the same file (sorted by modification time reverse so it is found first) . in case of  modified_time, the "min_file_process_interval" is not honored so it will be picked up immediately and added to the list.
   The check in DagFileProcessorManager does not include self.__file_path_queue - so I believe if the same file is already in the queue, it will be added for the second time (this queue is a list):
   ```
           file_paths_to_exclude = set(file_paths_in_progress).union(
               file_paths_recently_processed, files_paths_at_run_limit
           )
   ```
   
   4) Then the same file appears twice in the self._file_path_queue and if they are next to each other in the queue and so it happens that they are picked up by two parallel parser processes. By the time those parsers pick them up they pick the same version of the files and they try to insert the same TAG for the same dag -> hence the duplicate entry. 
   
   Does it sound plausible? Did I miss something? I think the solution for that will be:
   
   ```
           file_paths_to_exclude = set(file_paths_in_progress).union(
               file_paths_recently_processed, files_paths_at_run_limit
           ).union(set(self._file_path_queue))
   ```
   
   Does it look legit ? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914609904


   All right . It seems that MySQL is worse than I thought as they have no good utf8 case sensitive collation (!) https://stackoverflow.com/questions/4558707/case-sensitive-collation-in-mysql - the best you can (likely) do is `utf8mb3_bin` (it messes order-by but in most cases it will be ok when no accented characters are used as ids they will be alphabetically sorted.
   
   God. I HATE Mysql for all that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910671083


   The issue was initially noticed on 2.1.2 version, but I have given it a test on version 2.1.3 now and I'm having it.
   
   Perhaps, this might point in this direction:
   1. Store dags in DB option is enabled in airflow.cfg
   2. To make scheduler crash it should be a lot of dags with this problem so that all processes that are doing parsing kicked in simultaneously and raised exception. I tried it with just single dag and it didn't crash the scheduler (tho, spammed a lot of stacktraces in scheduler and the dag is essentially broken).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914582333


   > Just one more question - can you see the same if you just "change" the tag (i.e. no case-sensitive-only change) "test" -> "test1" for example? Since we cannot reproduce it, I am looking for a way to find out if it might be a DB configuration of some sorts.
   
   Doesn't happen. Completely different tag is fine, but same tag with different case is not. I was hoping its some kind case insensitive primary key, but I see no indication of existence of such term at all.
   
   
   > Very, very interesting. Curious to see it with the fix. Is it also possible that you add two wanings to those two sides of the condition and see the Warnings generated? (there will be a lot of them but I wonder whether the "if" statement works as expected in your case)
   > 
   > https://github.com/apache/airflow/blob/430976caad5970b718e3dbf5899d4fc879c0ac89/airflow/utils/sqlalchemy.py#L247
   
   Both clauses in `if` are reporting match as True.
   The issue is happening even in Mysql 8, I tested even with proposed above manager.py correction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-913030831


   Just one more question - can you see the same if you just "change" the tag (i.e. no case-sensitive-only change) "test" -> "test1" for example? Since we cannot reproduce it, I am looking for a way to find out if it might be a DB configuration of some sorts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910671083


   The issue was initially noticed on 2.1.2 version, but I have given it a test on version 2.1.3 now and I'm having it there as well.
   
   Perhaps, this might point in this direction:
   1. Store dags in DB option is enabled in airflow.cfg
   2. To make scheduler crash it should be a lot of dags with this problem so that all processes that are doing parsing kicked in simultaneously and raised exception. I tried it with just single dag and it didn't crash the scheduler (tho, spammed a lot of stacktraces in scheduler and the dag is essentially broken).
   
   The exact stacktrace is very long but here's the most important part of it:
   ```
   ...
   File "/venv/lib/python3.6/site-packages/airflow/models/dag.py", line 1920, in bulk_write_to_db
       DagCode.bulk_sync_to_db([dag.fileloc for dag in orm_dags])
   ...
   File "/venv/lib/python3.6/site-packages/MySQLdb/connections.py", line 259, in query
      _mysql.connection.query(self, query)
   sqlalchemy.exc.IntegrityError: (MySQLdb._exceptions.IntegrityError) (1062, "Duplicate entry "Test-testdag" for key 'PRIMARY'")
   [SQL: INSERT INTO dag_tag (name, dag_id) VALUES (%s, %s)]
   [parameters: ('Test', 'testdag')]
   (Background on this error at: http://sqlalche.me/e/13/gpkj )
   ```
   
   The dag initially had tags `['test']` and then the tag listing was changed to `['Test']`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910795269


   Sure, I understand I'm somewhat on my own when it comes to MariaDB. I was able to see this issue on Mysql 8 as well, and I have received exactly same stacktrace there. I myself noticed it on Mysql if it wasn't clear from initial message :)
   
   Transaction isolation is decided on session level, so I would assume Airflow/Sqlalchemy internals themselves set some, so it shouldn't matter for me as a user of airflow as I don't have control over it?
   Default for mysql 8 is `repeatable read` based on docs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-911951115


   Very, very interesting. Curious to see it with the fix. Is it also possible that you add two wanings to those two sides of the condition and see the Warnings generated?  (there will be a lot of them but I wonder whether the "if" statement works as expected in your case)
   
   https://github.com/apache/airflow/blob/430976caad5970b718e3dbf5899d4fc879c0ac89/airflow/utils/sqlalchemy.py#L247
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910705755


   1. Single scheduler only
   2. 2 parsing processes
   3. I was able to replicate issue on cluster that had just 3 dags. (so might be not extremely difficult)
   4. sort_mode is modified_time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910717496


   What is your database and what is your transaction isolation level? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-909728670


   I was unable to reproduce this issue on the current main branch of Airflow, could you share the scheduler logs surrounding the scheduler crash?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910697939


   This looks like a problem when some parses are still parsing the old versions and then some other parsers kicking in to parse the new ones, working in parallel. I do not think it's the matter of changing case. I think it's the matter of changing tag for the same task which is parsed in parallell with slightly different content by two parsers, How many schedulers you have and what's the number of parsing processes you have  ? 
   https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#parsing-processes
   Also what's your parsing sort_mode (https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#file-parsing-sort-mode) ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910795269


   Sure, I understand I'm somewhat on my own when it comes to MariaDB. I was able to see this issue on Mysql 8 as well, and I have received exactly same stacktrace there. I myself noticed it on Mysql if it wasn't clear from initial message :)
   
   Transaction isolation is decided on session level, so I would assume Airflow/Sqlalchemy internals themselves set some, so it shouldn't matter for me as a user of airflow as I don't have control over it?
   Default for mysql 8 is `repeatable read` based on Mysql docs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 edited a comment on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 edited a comment on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910851491


   Here's what I tried just now (one at a time, of course)
   1. Above mentioned manager.py correction
   2. Restrict amount of parsing processes to 1.
   3. Completely bring scheduler down for time of updating dag tags, then bring it back up
   4. Increase min_file_process_interval
   5. Change "use_row_level_locking" to False
   
   In all the possible options I'm receiving the above mentioned SQL error. If I decrease amount of processes to 1 the scheduler immediately chonks and exits with non-zero exit code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-910724279


   1. Mariadb (the issue was also noticed on mysql 8 tho, so I'm not certain if it's locking related).
   2. Use row locking is set to True.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimon222 commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
dimon222 commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914671906


   @potiuk 
   Well, when I briefly read collation docs I was kind of similar feeling that they went nuts on this.
   I will avoid going much offtop, but in my opinion there's a reason MySQL and MariaDB still pop:
   1. Legacy and modern world living together (thats what I do for life, not sure yet if its for good or for bad). People who yesterday just jumped off Oracle, transitioned to MySQL to find out that there's Postgres waiting for them.
   2. MariaDB/MySQL are also still there because of on-premise enterprise support barely offered for Postgres solutions. Yeah, we all need not just automatic backups, but scalable clusters that can recover immediately without 50 stars github application doing some behind the scene meddling. All that have to work on-premise and be reasonably priced. (lmao)
   
   Incase this topic "mysql/mariadb, why and should we" does need more discussion (hey, I still remember how hard was Gitlab at making this decision!), I would suggest transition to Slack thread or perhaps separate dedicated Issue discussion.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914664135


   > PS: solution works for both mariadb and mysql
   
   Great you checked - I also already saw that Our CI tests pass with utf8mb3_bin. Great that you found it before we made it a default for mysql! 
   
   > @potiuk I would highly recommend to also add test case for such scenario... Just incase.
   
   I do not even know what KIND of test we want to start with there? What logic would you recommend?
   
   ** BEGINING OF RANT **
   
   MySQL and MariaDB are very bad about the collation/encoding and it's really painful to workaround this - they have a long history of problems with those. First of all latin1 default encoding (with latin1_swedish_ci as default collation - note the case insensivity here as well) up until MySQL8 (Hey MySQL is from Sweden ;) ). 
   
   Then they changed it to `utf8` in Mysql8. But hey.. When you use `utf8` in MySQL8 it's actually utf8mb3 which is (listen to that!) deprecated in the very same MySQL8 and  it will be replaced at some point of time 
   
   The message below is ridiculous, yet it is official documentation of MySQL: https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html
   
   > The utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at some point utf8 is expected to become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.
    
   Now the problem is that they have also index size problem. Size of index is very limited and you blow up past the max index size once you have utf8mb4 (hey - it takes 4 times the number of characters you can fit). This is the reason why we had to specifically set utf8mb3_bin as collation for ids, because otherwise you would not be able to create Airflow DB. And yet the utf8mb3 is deprecated, but we have to use it.
   
   For worse - the encoding can be specified in 4 places. In 3 places in the database and separately in the client. And the one on the client might be completely diferent than the one on the server (it is by default derived from the LANG settings you have on the client and NOT from what your server's encoding/collation is - so, if by chance you have wrong settings of LANG on your client and you will not force it properly in the URL of the connection, then you are out of luck and if you happen to use an accented character (and only then) you will get random unicode/decode errors because the client will try to decode the characters encoded in different charset. 
   
   Also here you might find a very nice tutorial - where you will learn how to deal with collation on MySQL https://www.mysqltutorial.org/mysql-collation/ - when you go the end of it, you will see:
   
   > Now, the c1 column has the latin1  character set, but what about its collation? Is it inheriting the latin1_german1_ci collation from the table’s collation?
   > The answer is no. The reason is that the default collation of the latin1 character set is latin1_swedish_ci, therefore, the c1 column will have the latin1_swedish_ci collation.
   
   In an official nice tutorial they mention that "even if you think that the encoding is this, it is completely counter-intuitive and it's something different".... C'mon.
   
   ** END OF RANT **
   
   No. Honestly @dimon222  - I do not even know where to start and how to write any kind of tests that would deal with that madness. When I started my work with Airflow, I was warned that MySQL is a poor database. Well. 
   
   YES. IT IS TERRIBLE DATABASE.. DON'T USE IT. USE POSTGRES
   
   Switch as soon as you can. Google Composer team switched to Postgres in Airflow 2 from MySQL in 1.10 because they realised that this is the only way they can deal with MySQL.
   
   For now we have tests that create and run the MySQL database with 'utf8mb4' encoding and `utf8mb3_bin` collation for ids. That's it. This is the only supported setting. No more.
   
   (Sorry I had to vent my frustration . I hope you will find it entertaining nevertheless. I find it amusing how they managed to screw the encoding stuff up such badly).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org