You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Jarek Potiuk (Jira)" <ji...@apache.org> on 2020/02/27 12:17:00 UTC

[jira] [Created] (AIRFLOW-6947) UTF8mb4 encoding for mysql does not work in Airflow 2.0

Jarek Potiuk created AIRFLOW-6947:
-------------------------------------

             Summary: UTF8mb4 encoding for mysql does not work in Airflow 2.0
                 Key: AIRFLOW-6947
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6947
             Project: Apache Airflow
          Issue Type: Improvement
          Components: mysql, database
    Affects Versions: 2.0.0
            Reporter: Jarek Potiuk


The problem is with how MySQL handles different encodings. Especially UTF8. UTF8 in Mysql - default utf8 encoding - does not handle all UTF8 characters (only those encoded in 3 bytes) - the 4-bytes one are not working (there is an error -  "Incorrect string value: '\\xF0....' for column 'description' at row 1") when you try to insert DAG with 4-bytes character unicode.

This a problem for example with DAG description that is stored in the database. One of our customers had this very issue with it's database and there database encoding is utf8. Current utf8 behaviour - is that it is an alias to utf8mb3 https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html which means it does not handle all characters (mostly Emojis) . In some future versions of mysql - UTF8 will become alias for utf8mb4 (https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html) which supports full range of UTF-encoded characters. It is strongly advised to use utf8mb4 directly as default encoding.

I decided to see how it works with utf8mb4 encoding and - unfortunately it turns out that in case we switch to it, migration scripts for Airflow fails because row size for at least one of the indexes exceeds maximum row size:

‘'Specified key was too long; max key length is 3072 bytes'” when XCOM key is created.

ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id, task_id, `key`, execution_date)]

Apparently increased size of some columns (key?) make the row too big for utf8mb4 (in utf8mb4 encoding the text fields take 4x number of characters).

In our CI we had so far the default mysql encoding (which for the uninitiated is latin1_swedish_ci (!), I switched it to utf8mb4 so that you can see the behaviour - and created PR here https://github.com/apache/airflow/pull/7570 and failed test here:

https://travis-ci.org/apache/airflow/jobs/655733996?utm_medium=notification&utm_source=github_status 

Note similar problem occurs in 1.10 with MySQL 5.6 - if I change the charset to utf8mb4 and choose 5.6 mysql, it will fail because there the max key length was half the size (1536 characters).

There is even an issue for it in our JIRA https://issues.apache.org/jira/browse/AIRFLOW-3786. The workaround was to use the UTF8  (UTF8mb3) or switching to MySQL 5.7.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)