You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/08/19 13:37:40 UTC

[GitHub] [airflow] potiuk opened a new pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

potiuk opened a new pull request #17729:
URL: https://github.com/apache/airflow/pull/17729


   The index size is too big in case utf8mb4 is used as encoding
   for MySQL database. We already had `sql_engine_collation_for_ids`
   configuration to allow the id fields to use different collation,
   but the user had to set it up manually in case of a failure to
   create a db and it was not obvious, not discoverable and rather
   clumsy.
   
   Since this is really only a problem with MySQL the easy solution
   is to force this parameter to utf8mb3_general_ci for all mysql
   databases. It has no negative consequences, really as all
   relevant IDs are ASCII anyway.
   
   Related: #17603
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/main/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
uranusjr commented on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903182497


   > This is how it works actually :). Effectively for MySQL (and now also mariadb) the default will be `utf8mb3_general_ci` but if you set collation parameter, it will be used instead.
   
   Oh, so I was misunderstanding the comment in the config file:
   
   > You can override this parameter to any collation you want, but on mysql it will be set always to `utf8mb3_general_ci` so that the index sizes of our index keys will not exceed maximum size of allowed index.
   
   This reads to me like Airflow would always use `utf8mb3_general_ci` on MySQL no matter what value the config is set. Perhaps this should be reworded?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903249660


   @uranusjr : 
   
   > Collation for ``dag_id``, ``task_id``, ``key`` columns in case they have different encoding.
   > By default this collation is the same as the database collation, however for ``mysql`` and ``mariadb``
   > the default is ``utf8mb3_general_ci`` so that the index sizes of our index keys will not exceed
   > the maximum size of allowed index when collation is set to utf8mb4 variant
   > (see https://github.com/apache/airflow/pull/17603#issuecomment-901121618).
   
   Fixed the comment/parameter description


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903249660


   @uranusjr : 
   
   > Collation for ``dag_id``, ``task_id``, ``key`` columns in case they have different encoding.
   > By default this collation is the same as the database collation, however for ``mysql`` and ``mariadb``
   > the default is ``utf8mb3_general_ci`` so that the index sizes of our index keys will not exceed
   > the maximum size of allowed index when collation is set to ``utf8mb4`` variant
   > (see https://github.com/apache/airflow/pull/17603#issuecomment-901121618).
   
   Fixed the comment/parameter description!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903269651


   The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] siddharthvp commented on a change in pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
siddharthvp commented on a change in pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#discussion_r693308540



##########
File path: airflow/models/base.py
##########
@@ -44,6 +44,19 @@ def get_id_collation_args():
     if collation:
         return {'collation': collation}
     else:
+        # Automatically use utf8mb3_general_ci collation for mysql
+        # This is backwards-compatible. All our IDS are ASCII anyway so even if
+        # we migrate from previously installed database with different collation and we end up mixture of
+        # COLLATIONS, it's not a problem whatsoever (and we keep it small enough so that our indexes
+        # for MYSQL will not exceed the maximum index size.
+        #
+        # See https://github.com/apache/airflow/pull/17603#issuecomment-901121618.
+        #
+        # We cannot use session/dialect as at this point we are trying to determine the right connection
+        # parameters, so we use the connection
+        conn = conf.get('core', 'sql_alchemy_conn', fallback='')
+        if conn.startswith('mysql'):

Review comment:
       ```suggestion
           if conn.startswith('mysql') or conn.startswith('mariadb'):
   ```
   
   Not that it's officially supported, but with #16907 open, doesn't hurt?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#discussion_r693323408



##########
File path: airflow/models/base.py
##########
@@ -44,6 +44,19 @@ def get_id_collation_args():
     if collation:
         return {'collation': collation}
     else:
+        # Automatically use utf8mb3_general_ci collation for mysql
+        # This is backwards-compatible. All our IDS are ASCII anyway so even if
+        # we migrate from previously installed database with different collation and we end up mixture of
+        # COLLATIONS, it's not a problem whatsoever (and we keep it small enough so that our indexes
+        # for MYSQL will not exceed the maximum index size.
+        #
+        # See https://github.com/apache/airflow/pull/17603#issuecomment-901121618.
+        #
+        # We cannot use session/dialect as at this point we are trying to determine the right connection
+        # parameters, so we use the connection
+        conn = conf.get('core', 'sql_alchemy_conn', fallback='')
+        if conn.startswith('mysql'):

Review comment:
       Would `conn.dialect.name == "mysql"` be more robust?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903183021


   > > You can override this parameter to any collation you want, but on mysql it will be set always to `utf8mb3_general_ci` so that the index sizes of our index keys will not exceed maximum size of allowed index.
   
   Did I wrote it :D ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903182974


   > I'd _like_ to allow non ascii characters in DAG ids in the future.
   
   Yep. Me too :).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903183021


   > > You can override this parameter to any collation you want, but on mysql it will be set always to `utf8mb3_general_ci` so that the index sizes of our index keys will not exceed maximum size of allowed index.
   
   Did I write it :D ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#discussion_r693337807



##########
File path: airflow/models/base.py
##########
@@ -44,6 +44,19 @@ def get_id_collation_args():
     if collation:
         return {'collation': collation}
     else:
+        # Automatically use utf8mb3_general_ci collation for mysql
+        # This is backwards-compatible. All our IDS are ASCII anyway so even if
+        # we migrate from previously installed database with different collation and we end up mixture of
+        # COLLATIONS, it's not a problem whatsoever (and we keep it small enough so that our indexes
+        # for MYSQL will not exceed the maximum index size.
+        #
+        # See https://github.com/apache/airflow/pull/17603#issuecomment-901121618.
+        #
+        # We cannot use session/dialect as at this point we are trying to determine the right connection
+        # parameters, so we use the connection
+        conn = conf.get('core', 'sql_alchemy_conn', fallback='')
+        if conn.startswith('mysql'):

Review comment:
       > Not that it's officially supported, but with #16907 open, doesn't hurt?
   
   Agree :)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903182801


   > > In this case, would `ascii_general_ci` be even better?
   > 
   > Maybe in the future we will allow non-ascii for IDs? Who knows. Utf8_mb3 is fully ascii-compatible, so I think we are ok with that.
   
   I'd _like_ to allow non ascii characters in DAG ids in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #17729:
URL: https://github.com/apache/airflow/pull/17729


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903283048


   ough. Must have been lost at rebase. I will fix it now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#discussion_r693337138



##########
File path: docs/apache-airflow/howto/set-up-database.rst
##########
@@ -154,14 +154,18 @@ In the example below, a database ``airflow_db`` and user  with username ``airflo
 
 .. code-block:: sql
 
-   CREATE DATABASE airflow_db CHARACTER SET utf8 COLLATE utf8_general_ci;
+   CREATE DATABASE airflow_db CHARACTER SET utf8 COLLATE utf8mb4_unicode_ci;
    CREATE USER 'airflow_user' IDENTIFIED BY 'airflow_pass';
    GRANT ALL PRIVILEGES ON airflow_db.* TO 'airflow_user';
 
 
 .. note::
 
-   The database must use a UTF-8 character set
+   The database must use a UTF-8 character set. A small caveat that you must be aware of is that utf8 in newer versions of MySQL is really utf8mb4 which
+   causes Airflow indexes to grow too large (see https://github.com/apache/airflow/pull/17603#issuecomment-901121618). Therefore as of Airflow 2.2
+   all MySQL databases have ``sql_engine_collation_for_ids`` set automatically to ``utf8mb3_general_ci`` (unless you override it). This might

Review comment:
       not realy., see few lines above the change:
   
   ```
       if collation:
           return {'collation': collation}
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] siddharthvp commented on a change in pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
siddharthvp commented on a change in pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#discussion_r693308710



##########
File path: docs/apache-airflow/howto/set-up-database.rst
##########
@@ -154,14 +154,18 @@ In the example below, a database ``airflow_db`` and user  with username ``airflo
 
 .. code-block:: sql
 
-   CREATE DATABASE airflow_db CHARACTER SET utf8 COLLATE utf8_general_ci;
+   CREATE DATABASE airflow_db CHARACTER SET utf8 COLLATE utf8mb4_unicode_ci;
    CREATE USER 'airflow_user' IDENTIFIED BY 'airflow_pass';
    GRANT ALL PRIVILEGES ON airflow_db.* TO 'airflow_user';
 
 
 .. note::
 
-   The database must use a UTF-8 character set
+   The database must use a UTF-8 character set. A small caveat that you must be aware of is that utf8 in newer versions of MySQL is really utf8mb4 which
+   causes Airflow indexes to grow too large (see https://github.com/apache/airflow/pull/17603#issuecomment-901121618). Therefore as of Airflow 2.2
+   all MySQL databases have ``sql_engine_collation_for_ids`` set automatically to ``utf8mb3_general_ci`` (unless you override it). This might

Review comment:
       Your other changes suggest it can't be overridden for mysql databases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903249660


   @uranusjr : 
   
   > Collation for ``dag_id``, ``task_id``, ``key`` columns in case they have different encoding.
   > By default this collation is the same as the database collation, however for ``mysql`` and ``mariadb``
   > the default is ``utf8mb3_general_ci`` so that the index sizes of our index keys will not exceed
   > the maximum size of allowed index when collation is set to ``utf8mb4`` variant
   > (see https://github.com/apache/airflow/pull/17603#issuecomment-901121618).
   
   Fixed the comment/parameter description### 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903096436


   > In this case, would `ascii_general_ci` be even better? 
   
   Maybe in the future we will allow non-ascii for IDs? Who knows. Utf8_mb3 is fully ascii-compatible, so I think we are ok with that.
   
   > If not, I feel we should just _default_ to `utf8mb3_general_ci` but still allow users to override it to something else. If the user uses `utf8mb4_general_ci`, that’s on them; we can document this or even explicitly prevent this user error by raising an exception.
   
   This is how it works actually :). Effectively for MySQL (and now also mariadb) the default will be `utf8mb3_general_ci`  but if you set collation parameter, it will be used instead. This whole (`if mysql`) only happens if collation is not set (see few lines above)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#discussion_r693336982



##########
File path: airflow/models/base.py
##########
@@ -44,6 +44,19 @@ def get_id_collation_args():
     if collation:
         return {'collation': collation}
     else:
+        # Automatically use utf8mb3_general_ci collation for mysql
+        # This is backwards-compatible. All our IDS are ASCII anyway so even if
+        # we migrate from previously installed database with different collation and we end up mixture of
+        # COLLATIONS, it's not a problem whatsoever (and we keep it small enough so that our indexes
+        # for MYSQL will not exceed the maximum index size.
+        #
+        # See https://github.com/apache/airflow/pull/17603#issuecomment-901121618.
+        #
+        # We cannot use session/dialect as at this point we are trying to determine the right connection
+        # parameters, so we use the connection
+        conn = conf.get('core', 'sql_alchemy_conn', fallback='')
+        if conn.startswith('mysql'):

Review comment:
       Conn. is just a string. We do not want to instantiate a session just yet at this moment, wy just want to use connection string to make a decision here!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on pull request #17729: Automatically use utf8mb3_general_ci collation for mysql

Posted by GitBox <gi...@apache.org>.
uranusjr commented on pull request #17729:
URL: https://github.com/apache/airflow/pull/17729#issuecomment-903080753


   > It has no negative consequences, really as all relevant IDs are ASCII anyway.
   
   In this case, would `ascii_general_ci` be even better? If not, I feel we should just *default* to `utf8mb3_general_ci` but still allow users to override it to something else. If the user uses `utf8mb4_general_ci`, that’s on them; we can document this or even explicitly prevent this user error by raising an exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org