You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Jarek Potiuk <Ja...@polidea.com> on 2020/03/01 19:40:59 UTC

Re: MySQL version support for Airflow 5.6 ->? 5.7

>
>
>
> does c) really help? We'd still need a unique key on the existing PK
> columns - i.e. does Mysql just complain about the length/size of the PK
> index, but not other unique indexes?
>

Right indeed - so  the problem (and I understand the reason why we changed
1.10 -> 2.0 ) is also the uniqueness of key at insert. I imagine it is an
occasional problem in 1.10 where multiple tasks re-run can insert the same
key twice (ouch!). Now I see why the change was introduced.

Unfortunately - It won't work if we try to index all 4 columns. Or even 3
(without execution date).  MySQL/InnoDB has index prefix limit
<https://dev.mysql.com/doc/refman/5.7/en/innodb-limits.html> for all
indexes not only primary (and this is with default settings- 8KB page size
+ large_prefix enabled).  In our XCom table, the Key is 512 characters
which translates to 2048 bytes in case of utf8mb4. Each of the
dag_id/task_id is 250 characters (luckily not 256 - you will see why!).
When you combine dag_id, task_id, key and execution date in single index
you get 4*250 + 4*250 + 4* 512 + 8 = *4056* bytes. And the max key length
for MySQL is *3072* bytes.

So we have to find another way or simply say we don't support mysql's
full-UTF8-character set. :( . By the way - we are lucky that dag_id and
task_id are 250 chars and not 256 - if we were 256, then even utf8mb3 would
not be supported as the index size would be *3080* bytes).

> Irrespective of mysql's quirks, we would almost never look up rows by this
> int-pk (we look up by dag+task+key, and often but not _awlays_ by
> execution_date as well), so I don't think we'd get any other benefits from
> it.
>

The benefit I spoke about is potentially much smaller index to look-up. In
the current setup the primary key could be used by MSQL to do XCom searches
and depending how well the optimiser works, it could actually be slower
than using secondary index. I believe there are some good reasons why the
size of index are limited (index is read in 16KB blocks at a time so you
have max 5 entries for index loaded at a time with 3072 size index and max
8 with 2100 one. That might affect performance of the search - I bet with
such big indexes we often hit the limits of cache an a lot of the index is
read from the disk rather than kept in memory which is a huge performance
problem that you start to see only when you have a lot of rows (in this
case - a lot of XCom values stored) - this is a big problem the maximum
size of index in MySQL tries to avoid.

I checked that we have those lookups for Xcom:

1. ( dag_ids, task_id, key, execution_date ==  date)
2. ( dag_ids, task_id, key, execution_date <=  date)
3. ( dag_ids, key, execution_date ==  date) (no task_id  - for lineage)

I could not find the "no execution-date" lookup for XCom. Did I miss it
somehow ? Or maybe it's something that was in the past but it's not any
more?

In 1.10 (and it is still in 2.0) we have this index :
'idx_xcom_dag_task_date', 'xcom', ['dag_id', 'task_id', 'execution_date'] -
and it was used for most of the lookups (assumption was that key number was
small). But in 2.0 the primary key as defined currently can (and probably
will) be used for most of the lookups 1. (maybe partially for 3 as the
primary key is (key, execution_date, task_id, dag_id)) and - depending how
the optimiser is written - with this size of the index it might be slower
than secondary index lookup.

Surprisingly, it looks like the lineage lookup (3.) uses almost full table
scan - it can at most filter the dag_ids on the index but then it has to
full-scan the rest. This is something to optimise regardless.

The current index won't work for sure with utf8mb4 - it fails at creation.
So we have to come up with some alternative approach for both - lookups and
unique check.

If decreasing the size of the fields is not a good idea (maybe we should
consider that!), then I think what we could do:

   - hash the content of dag_id, task_id, key (or just hash the key) =
   using md5 or similar and make primary key out of that - this way we can
   handle uniqueness check
   - keep the old index (dag_id, task_id, execution_date)
   - add new index for dag_id, key, execution_date - to support lineage
   lookup (dag_id, key, execution_date) = *3056 bytes* - just below the
   limit.

Thoughts?

 J.

-ash
> On Feb 27 2020, at 1:25 pm, Jarek Potiuk <Ja...@polidea.com> wrote:
> > I think the main issue here is not performance but to make it works and
> be
> > compatible with 1.10. .
> >
> > And indeed if we go back for I'd index we would have to check the impact.
> > Was insert performance the original reason why it has changed ? Anyone
> > knows ?
> >
> > I am not sure what it is going to be but taking into account that those
> > inserts in xcom table are always individual insert (not bulks) and there
> is
> > likely more reads than writes - it might be either way easily.
> >
> > J..
> > czw., 27 lut 2020, 13:55 użytkownik Kamil Breguła <
> kamil.bregula@polidea.com>
> > napisał:
> >
> > > Hello,
> > > Creating a sequence/AUTOINCREMENT will have a very large impact on the
> > > speed of adding rows.
> > > More information:
> https://docs.sqlalchemy.org/en/13/faq/performance.html
> > > In the near future I wanted to speed up the saving of items by using a
> > > low-level API. This will not work if we different PK.
> > >
> > > Best regards,
> > > Kamil
> > >
> > > On Thu, Feb 27, 2020 at 1:50 PM Jarek Potiuk <Jarek.Potiuk@polidea.com
> >
> > > wrote:
> > > >
> > > > I am testing now 5.7 in 1.10 branch with the PR here:
> > > > https://github.com/apache/airflow/pull/7569 and it seems it's an
> easy
> > > > one one - seems it will "just work".
> > > >
> > > > However while testing it (I analysed a customer problem) I have found
> > > > that we have an issue in 2.0 with MySQL (not only 5.7 - it's the same
> > > > for 5.6 and even 8.0 (!). It boils down to an index (at least one) is
> > > > to big for the recommended uf8mb4 encoding:
> > > >
> > > > ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id, task_id,
> > > > `key`, execution_date).
> > > >
> > > > The utf8mb4 encoding is needed to handle 4-byte UTF characters
> (mostly
> > > > emojis but also some Chinese characters).
> > > >
> > > > The question is what do we do with that. I think we have two options.
> > > > Not sure what was the story behind increasing the lengths (apparently
> > > > the problem does not
> > > >
> > > > a) decrease sizes of some of the fields (that would make backwards
> > > > incompatibility)
> > > > b) partial indexes of the fields (not really - this is primary key so
> > > > I guess it's not a good idea, although MYSQL supports indexing only
> > > > part of the string - we could cut key uniqueness at 200 chars for
> > > > example - but again, backwards incompatibility is a problem)
> > > > c) revert to id - generated primary kay and make secondary index
> > > > without the key - there are usually not many xcoms per task, so I
> > > > would not expect any problems - actually I would expect this will
> > > > speed up searches because the index will be much, much smaller (50%
> > > > more or less). I think this is the most viable option.
> > > >
> > > > Any other ideas?
> > > > I described all details in this issue:
> > > > https://issues.apache.org/jira/browse/AIRFLOW-6947
> > > >
> > > > Here is similar story of somone who had similar issue
> > > > https://serversforhackers.com/c/mysql-utf8-and-indexing
> > > >
> > > > J.
> > > >
> > > > Jarek Potiuk
> > > > Polidea | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: MySQL version support for Airflow 5.6 ->? 5.7

Posted by Jarek Potiuk <Ja...@polidea.com>.

FYI. While testing 5.7 I have found
https://issues.apache.org/jira/browse/AIRFLOW-7001 - fix is in
https://github.com/apache/airflow/pull/7641

It looks like Airflow could not work with MySQL 5.7 because it tried to
insert '2020-03-07 07:32:34.121705+00:00' format of time for TIMESTAMP
field. For 5.6 it was ok because timezone was IGNORED (!) (luckily we
always inserted it with UTC timezone I believe). For MySQL 5.7 it started
to fail with "invalid datetime format' 😱.

J.


On Mon, Mar 2, 2020 at 8:48 AM Jarek Potiuk <Ja...@polidea.com>
wrote:

> Or actually we can force the charset to utf8mb3 to keep the potential to
> use most utf8 characters in the ids/keys I will test it shortly but it
> should work.
>
> On Mon, Mar 2, 2020 at 8:38 AM Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
>> Another option that came to me - and possibly that's the simplest and
>> best one.
>>
>> We can force charset for only the _id + key columns to be latin1. This
>> might solve all the problems at once with pretty much zero impact on the
>> rest of the system. It seems super-simple.
>>
>>
>> https://stackoverflow.com/questions/18561190/enforce-column-encoding-with-sqlalchemy
>>
>>
>>
>> J.
>>
>>
>> On Sun, Mar 1, 2020 at 9:47 PM Jarek Potiuk <Ja...@polidea.com>
>> wrote:
>>
>>>
>>>> Shouldn't be a problem in practice as we delete XComs for a task before
>>>> running out again. Might happen in very rare edge casesi guess.
>>>>
>>>
>>>
>>>> Experience has taught me that if you want the mysql optimiser to do
>>>> something sensible: it won't. This may be better in newer versions, but was
>>>> still a problem in 5.5.
>>>>
>>>> Smaller index lookups from where/what? All queries from dags etc would
>>>> not use `WHERE id = ?` as you detail below.
>>>>
>>> In 2.0 there is the "huge" PK (dag_id, task_id, execution_date, key) and
>>> "biggish" secondary index (dag_id, task_id, execution_date). When you
>>> select by all four: ( dag_id, task_id, execution_date, key) - the
>>> optimizer might choose either of the two (depends on how many rows are in
>>> the table, what is the distribution of keys etc.). And my bet is that
>>> having both indexes now might "confuse" the optimiser to always choose
>>> primary key, which *Might* be slower due to its size (but this is pure
>>> speculation seeing similar things in the past). It's just a side comment -
>>> it does not have to be slower and even if it is, it will be likely
>>> negligible overall.
>>>
>>>
>>>> 512 is probably longer than it needs to be, 250 could be long enough,
>>>> or perhaps we have slightly different behaviour (column length) on mysql
>>>> and other databases?
>>>>
>>>
>>> column length is always in characters - but it translates to more bytes
>>> for utf8mb4.... too long.
>>>
>>> Rough calculation says that we should have to decrease key column to
>>> 266. And maybe that's the easiest solution ? Maybe we should simply
>>> decrease the key column size to 266 when we have MySQL and UTF8MB4 as
>>> encoding???? I think this might be a good compromise between backwards
>>> compatibility and not limiting other installations that do not have that
>>> con. I think this will mean that some DAGs might work on Postgres but not
>>> on MySQL with UTF8mb4. But it's at least something that user can deal with
>>> and fix the DAG.
>>>
>>>
>>> Another possibility would be to give a TI an integer PK, and then make
>>>> XCom (ti_id, key) (fk not required).
>>>>
>>>> I guess the hash approach has precedent in Airflow, but I'm not a huge
>>>> fan of possible collisions, however rare (and would like to work towards
>>>> removing it from the current tables.)
>>>>
>>>
>>> Yeah I do not like it either - too much. If we can live with shorter key
>>> size for MySQL - I think that might be the best solution...
>>>
>>>
>>>>
>>>>
>>>> --
>>>
>>> Jarek Potiuk
>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>
>>> M: +48 660 796 129 <+48660796129>
>>> [image: Polidea] <https://www.polidea.com/>
>>>
>>>
>>
>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>
>>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: MySQL version support for Airflow 5.6 ->? 5.7

Posted by Jarek Potiuk <Ja...@polidea.com>.

Or actually we can force the charset to utf8mb3 to keep the potential to
use most utf8 characters in the ids/keys I will test it shortly but it
should work.

On Mon, Mar 2, 2020 at 8:38 AM Jarek Potiuk <Ja...@polidea.com>
wrote:

> Another option that came to me - and possibly that's the simplest and best
> one.
>
> We can force charset for only the _id + key columns to be latin1. This
> might solve all the problems at once with pretty much zero impact on the
> rest of the system. It seems super-simple.
>
>
> https://stackoverflow.com/questions/18561190/enforce-column-encoding-with-sqlalchemy
>
>
>
> J.
>
>
> On Sun, Mar 1, 2020 at 9:47 PM Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
>>
>>> Shouldn't be a problem in practice as we delete XComs for a task before
>>> running out again. Might happen in very rare edge casesi guess.
>>>
>>
>>
>>> Experience has taught me that if you want the mysql optimiser to do
>>> something sensible: it won't. This may be better in newer versions, but was
>>> still a problem in 5.5.
>>>
>>> Smaller index lookups from where/what? All queries from dags etc would
>>> not use `WHERE id = ?` as you detail below.
>>>
>> In 2.0 there is the "huge" PK (dag_id, task_id, execution_date, key) and
>> "biggish" secondary index (dag_id, task_id, execution_date). When you
>> select by all four: ( dag_id, task_id, execution_date, key) - the
>> optimizer might choose either of the two (depends on how many rows are in
>> the table, what is the distribution of keys etc.). And my bet is that
>> having both indexes now might "confuse" the optimiser to always choose
>> primary key, which *Might* be slower due to its size (but this is pure
>> speculation seeing similar things in the past). It's just a side comment -
>> it does not have to be slower and even if it is, it will be likely
>> negligible overall.
>>
>>
>>> 512 is probably longer than it needs to be, 250 could be long enough, or
>>> perhaps we have slightly different behaviour (column length) on mysql and
>>> other databases?
>>>
>>
>> column length is always in characters - but it translates to more bytes
>> for utf8mb4.... too long.
>>
>> Rough calculation says that we should have to decrease key column to 266.
>> And maybe that's the easiest solution ? Maybe we should simply decrease the
>> key column size to 266 when we have MySQL and UTF8MB4 as encoding???? I
>> think this might be a good compromise between backwards compatibility and
>> not limiting other installations that do not have that con. I think this
>> will mean that some DAGs might work on Postgres but not on MySQL with
>> UTF8mb4. But it's at least something that user can deal with and fix the
>> DAG.
>>
>>
>> Another possibility would be to give a TI an integer PK, and then make
>>> XCom (ti_id, key) (fk not required).
>>>
>>> I guess the hash approach has precedent in Airflow, but I'm not a huge
>>> fan of possible collisions, however rare (and would like to work towards
>>> removing it from the current tables.)
>>>
>>
>> Yeah I do not like it either - too much. If we can live with shorter key
>> size for MySQL - I think that might be the best solution...
>>
>>
>>>
>>>
>>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>
>>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: MySQL version support for Airflow 5.6 ->? 5.7

Posted by Jarek Potiuk <Ja...@polidea.com>.

Another option that came to me - and possibly that's the simplest and best
one.

We can force charset for only the _id + key columns to be latin1. This
might solve all the problems at once with pretty much zero impact on the
rest of the system. It seems super-simple.

https://stackoverflow.com/questions/18561190/enforce-column-encoding-with-sqlalchemy



J.


On Sun, Mar 1, 2020 at 9:47 PM Jarek Potiuk <Ja...@polidea.com>
wrote:

>
>> Shouldn't be a problem in practice as we delete XComs for a task before
>> running out again. Might happen in very rare edge casesi guess.
>>
>
>
>> Experience has taught me that if you want the mysql optimiser to do
>> something sensible: it won't. This may be better in newer versions, but was
>> still a problem in 5.5.
>>
>> Smaller index lookups from where/what? All queries from dags etc would
>> not use `WHERE id = ?` as you detail below.
>>
> In 2.0 there is the "huge" PK (dag_id, task_id, execution_date, key) and
> "biggish" secondary index (dag_id, task_id, execution_date). When you
> select by all four: ( dag_id, task_id, execution_date, key) - the
> optimizer might choose either of the two (depends on how many rows are in
> the table, what is the distribution of keys etc.). And my bet is that
> having both indexes now might "confuse" the optimiser to always choose
> primary key, which *Might* be slower due to its size (but this is pure
> speculation seeing similar things in the past). It's just a side comment -
> it does not have to be slower and even if it is, it will be likely
> negligible overall.
>
>
>> 512 is probably longer than it needs to be, 250 could be long enough, or
>> perhaps we have slightly different behaviour (column length) on mysql and
>> other databases?
>>
>
> column length is always in characters - but it translates to more bytes
> for utf8mb4.... too long.
>
> Rough calculation says that we should have to decrease key column to 266.
> And maybe that's the easiest solution ? Maybe we should simply decrease the
> key column size to 266 when we have MySQL and UTF8MB4 as encoding???? I
> think this might be a good compromise between backwards compatibility and
> not limiting other installations that do not have that con. I think this
> will mean that some DAGs might work on Postgres but not on MySQL with
> UTF8mb4. But it's at least something that user can deal with and fix the
> DAG.
>
>
> Another possibility would be to give a TI an integer PK, and then make
>> XCom (ti_id, key) (fk not required).
>>
>> I guess the hash approach has precedent in Airflow, but I'm not a huge
>> fan of possible collisions, however rare (and would like to work towards
>> removing it from the current tables.)
>>
>
> Yeah I do not like it either - too much. If we can live with shorter key
> size for MySQL - I think that might be the best solution...
>
>
>>
>>
>> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: MySQL version support for Airflow 5.6 ->? 5.7

Posted by Jarek Potiuk <Ja...@polidea.com>.

>
>
> Shouldn't be a problem in practice as we delete XComs for a task before
> running out again. Might happen in very rare edge casesi guess.
>


> Experience has taught me that if you want the mysql optimiser to do
> something sensible: it won't. This may be better in newer versions, but was
> still a problem in 5.5.
>
> Smaller index lookups from where/what? All queries from dags etc would not
> use `WHERE id = ?` as you detail below.
>
In 2.0 there is the "huge" PK (dag_id, task_id, execution_date, key) and
"biggish" secondary index (dag_id, task_id, execution_date). When you
select by all four: ( dag_id, task_id, execution_date, key) - the
optimizer might choose either of the two (depends on how many rows are in
the table, what is the distribution of keys etc.). And my bet is that
having both indexes now might "confuse" the optimiser to always choose
primary key, which *Might* be slower due to its size (but this is pure
speculation seeing similar things in the past). It's just a side comment -
it does not have to be slower and even if it is, it will be likely
negligible overall.


> 512 is probably longer than it needs to be, 250 could be long enough, or
> perhaps we have slightly different behaviour (column length) on mysql and
> other databases?
>

column length is always in characters - but it translates to more bytes for
utf8mb4.... too long.

Rough calculation says that we should have to decrease key column to 266.
And maybe that's the easiest solution ? Maybe we should simply decrease the
key column size to 266 when we have MySQL and UTF8MB4 as encoding???? I
think this might be a good compromise between backwards compatibility and
not limiting other installations that do not have that con. I think this
will mean that some DAGs might work on Postgres but not on MySQL with
UTF8mb4. But it's at least something that user can deal with and fix the
DAG.


Another possibility would be to give a TI an integer PK, and then make XCom
> (ti_id, key) (fk not required).
>
> I guess the hash approach has precedent in Airflow, but I'm not a huge fan
> of possible collisions, however rare (and would like to work towards
> removing it from the current tables.)
>

Yeah I do not like it either - too much. If we can live with shorter key
size for MySQL - I think that might be the best solution...


>
>
> --

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: MySQL version support for Airflow 5.6 ->? 5.7

Posted by Ash Berlin-Taylor <as...@apache.org>.

Oh the ti_id idea I mentioned below won't work very well for the case when we query by non-exact execution date

On 1 March 2020 20:06:08 GMT, Ash Berlin-Taylor <as...@apache.org> wrote:
>
>
>On 1 March 2020 19:40:59 GMT, Jarek Potiuk <Ja...@polidea.com>
>wrote:
>>>
>>>
>>>
>>> does c) really help? We'd still need a unique key on the existing PK
>>> columns - i.e. does Mysql just complain about the length/size of the
>>PK
>>> index, but not other unique indexes?
>>>
>>
>>Right indeed - so  the problem (and I understand the reason why we
>>changed
>>1.10 -> 2.0 ) is also the uniqueness of key at insert. I imagine it is
>>an
>>occasional problem in 1.10 where multiple tasks re-run can insert the
>>same
>>key twice (ouch!).
>
>Shouldn't be a problem in practice as we delete XComs for a task before
>running out again. Might happen in very rare edge casesi guess.
>
>> Now I see why the change was introduced.
>>
>>
>>Unfortunately - It won't work if we try to index all 4 columns. Or
>even
>>3
>>(without execution date).  MySQL/InnoDB has index prefix limit
>><https://dev.mysql.com/doc/refman/5.7/en/innodb-limits.html> for all
>>indexes not only primary (and this is with default settings- 8KB page
>>size
>>+ large_prefix enabled).  In our XCom table, the Key is 512 characters
>>which translates to 2048 bytes in case of utf8mb4. Each of the
>>dag_id/task_id is 250 characters (luckily not 256 - you will see
>why!).
>>When you combine dag_id, task_id, key and execution date in single
>>index
>>you get 4*250 + 4*250 + 4* 512 + 8 = *4056* bytes. And the max key
>>length
>>for MySQL is *3072* bytes.
>>
>>So we have to find another way or simply say we don't support mysql's
>>full-UTF8-character set. :( . By the way - we are lucky that dag_id
>and
>>task_id are 250 chars and not 256 - if we were 256, then even utf8mb3
>>would
>>not be supported as the index size would be *3080* bytes).
>>
>>
>>> Irrespective of mysql's quirks, we would almost never look up rows
>by
>>this
>>> int-pk (we look up by dag+task+key, and often but not _awlays_ by
>>> execution_date as well), so I don't think we'd get any other
>benefits
>>from
>>> it.
>>>
>>
>>The benefit I spoke about is potentially much smaller index to
>look-up.
>>In
>>the current setup the primary key could be used by MSQL to do XCom
>>searches
>>and depending how well the optimiser works, it could actually be
>slower
>>than using secondary index. I believe there are some good reasons why
>>the
>>size of index are limited (index is read in 16KB blocks at a time so
>>you
>>have max 5 entries for index loaded at a time with 3072 size index and
>>max
>>8 with 2100 one. That might affect performance of the search - I bet
>>with
>>such big indexes we often hit the limits of cache an a lot of the
>index
>>is
>>read from the disk rather than kept in memory which is a huge
>>performance
>>problem that you start to see only when you have a lot of rows (in
>this
>>case - a lot of XCom values stored) - this is a big problem the
>maximum
>>size of index in MySQL tries to avoid.
>
>Experience has taught me that if you want the mysql optimiser to do
>something sensible: it won't. This may be better in newer versions, but
>was still a problem in 5.5.
>
>Smaller index lookups from where/what? All queries from dags etc would
>not use `WHERE id = ?` as you detail below.
>
>>I checked that we have those lookups for Xcom:
>>
>>1. ( dag_ids, task_id, key, execution_date ==  date)
>>2. ( dag_ids, task_id, key, execution_date <=  date)
>>3. ( dag_ids, key, execution_date ==  date) (no task_id  - for
>lineage)
>>
>>I could not find the "no execution-date" lookup for XCom. Did I miss
>it
>>somehow ? Or maybe it's something that was in the past but it's not
>any
>>more?
>
>Ah no, it's probably the <= case I was thinking of (where you can
>select XCom for past dates too.)
>
>>
>>In 1.10 (and it is still in 2.0) we have this index :
>>'idx_xcom_dag_task_date', 'xcom', ['dag_id', 'task_id',
>>'execution_date'] -
>>and it was used for most of the lookups (assumption was that key
>number
>>was
>>small). But in 2.0 the primary key as defined currently can (and
>>probably
>>will) be used for most of the lookups 1. (maybe partially for 3 as the
>>primary key is (key, execution_date, task_id, dag_id)) and - depending
>>how
>>the optimiser is written - with this size of the index it might be
>>slower
>>than secondary index lookup.
>>
>>Surprisingly, it looks like the lineage lookup (3.) uses almost full
>>table
>>scan - it can at most filter the dag_ids on the index but then it has
>>to
>>full-scan the rest. This is something to optimise regardless.
>>
>>The current index won't work for sure with utf8mb4 - it fails at
>>creation.
>>So we have to come up with some alternative approach for both -
>lookups
>>and
>>unique check.
>
>512 is probably longer than it needs to be, 250 could be long enough,
>or perhaps we have slightly different behaviour (column length) on
>mysql and other databases?
>>
>>If decreasing the size of the fields is not a good idea (maybe we
>>should
>>consider that!), then I think what we could do:
>>
>>   - hash the content of dag_id, task_id, key (or just hash the key) =
>>using md5 or similar and make primary key out of that - this way we
>can
>>   handle uniqueness check
>>   - keep the old index (dag_id, task_id, execution_date)
>>   - add new index for dag_id, key, execution_date - to support
>lineage
>>   lookup (dag_id, key, execution_date) = *3056 bytes* - just below
>the
>>   limit.
>>
>>Thoughts?
>
>Another possibility would be to give a TI an integer PK, and then make
>XCom (ti_id, key) (fk not required).
>
>I guess the hash approach has precedent in Airflow, but I'm not a huge
>fan of possible collisions, however rare (and would like to work
>towards removing it from the current tables.)
>
>Ash
>>
>> J.
>>
>>
>>-ash
>>> On Feb 27 2020, at 1:25 pm, Jarek Potiuk <Ja...@polidea.com>
>>wrote:
>>> > I think the main issue here is not performance but to make it
>works
>>and
>>> be
>>> > compatible with 1.10. .
>>> >
>>> > And indeed if we go back for I'd index we would have to check the
>>impact.
>>> > Was insert performance the original reason why it has changed ?
>>Anyone
>>> > knows ?
>>> >
>>> > I am not sure what it is going to be but taking into account that
>>those
>>> > inserts in xcom table are always individual insert (not bulks) and
>>there
>>> is
>>> > likely more reads than writes - it might be either way easily.
>>> >
>>> > J..
>>> > czw., 27 lut 2020, 13:55 użytkownik Kamil Breguła <
>>> kamil.bregula@polidea.com>
>>> > napisał:
>>> >
>>> > > Hello,
>>> > > Creating a sequence/AUTOINCREMENT will have a very large impact
>>on the
>>> > > speed of adding rows.
>>> > > More information:
>>> https://docs.sqlalchemy.org/en/13/faq/performance.html
>>> > > In the near future I wanted to speed up the saving of items by
>>using a
>>> > > low-level API. This will not work if we different PK.
>>> > >
>>> > > Best regards,
>>> > > Kamil
>>> > >
>>> > > On Thu, Feb 27, 2020 at 1:50 PM Jarek Potiuk
>><Jarek.Potiuk@polidea.com
>>> >
>>> > > wrote:
>>> > > >
>>> > > > I am testing now 5.7 in 1.10 branch with the PR here:
>>> > > > https://github.com/apache/airflow/pull/7569 and it seems it's
>>an
>>> easy
>>> > > > one one - seems it will "just work".
>>> > > >
>>> > > > However while testing it (I analysed a customer problem) I
>have
>>found
>>> > > > that we have an issue in 2.0 with MySQL (not only 5.7 - it's
>>the same
>>> > > > for 5.6 and even 8.0 (!). It boils down to an index (at least
>>one) is
>>> > > > to big for the recommended uf8mb4 encoding:
>>> > > >
>>> > > > ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id,
>>task_id,
>>> > > > `key`, execution_date).
>>> > > >
>>> > > > The utf8mb4 encoding is needed to handle 4-byte UTF characters
>>> (mostly
>>> > > > emojis but also some Chinese characters).
>>> > > >
>>> > > > The question is what do we do with that. I think we have two
>>options.
>>> > > > Not sure what was the story behind increasing the lengths
>>(apparently
>>> > > > the problem does not
>>> > > >
>>> > > > a) decrease sizes of some of the fields (that would make
>>backwards
>>> > > > incompatibility)
>>> > > > b) partial indexes of the fields (not really - this is primary
>>key so
>>> > > > I guess it's not a good idea, although MYSQL supports indexing
>>only
>>> > > > part of the string - we could cut key uniqueness at 200 chars
>>for
>>> > > > example - but again, backwards incompatibility is a problem)
>>> > > > c) revert to id - generated primary kay and make secondary
>>index
>>> > > > without the key - there are usually not many xcoms per task,
>so
>>I
>>> > > > would not expect any problems - actually I would expect this
>>will
>>> > > > speed up searches because the index will be much, much smaller
>>(50%
>>> > > > more or less). I think this is the most viable option.
>>> > > >
>>> > > > Any other ideas?
>>> > > > I described all details in this issue:
>>> > > > https://issues.apache.org/jira/browse/AIRFLOW-6947
>>> > > >
>>> > > > Here is similar story of somone who had similar issue
>>> > > > https://serversforhackers.com/c/mysql-utf8-and-indexing
>>> > > >
>>> > > > J.
>>> > > >
>>> > > > Jarek Potiuk
>>> > > > Polidea | Principal Software Engineer
>>> > > >
>>> > > > M: +48 660 796 129

Re: MySQL version support for Airflow 5.6 ->? 5.7

Posted by Ash Berlin-Taylor <as...@apache.org>.


On 1 March 2020 19:40:59 GMT, Jarek Potiuk <Ja...@polidea.com> wrote:
>>
>>
>>
>> does c) really help? We'd still need a unique key on the existing PK
>> columns - i.e. does Mysql just complain about the length/size of the
>PK
>> index, but not other unique indexes?
>>
>
>Right indeed - so  the problem (and I understand the reason why we
>changed
>1.10 -> 2.0 ) is also the uniqueness of key at insert. I imagine it is
>an
>occasional problem in 1.10 where multiple tasks re-run can insert the
>same
>key twice (ouch!).

Shouldn't be a problem in practice as we delete XComs for a task before running out again. Might happen in very rare edge casesi guess.

> Now I see why the change was introduced.
>
>
>Unfortunately - It won't work if we try to index all 4 columns. Or even
>3
>(without execution date).  MySQL/InnoDB has index prefix limit
><https://dev.mysql.com/doc/refman/5.7/en/innodb-limits.html> for all
>indexes not only primary (and this is with default settings- 8KB page
>size
>+ large_prefix enabled).  In our XCom table, the Key is 512 characters
>which translates to 2048 bytes in case of utf8mb4. Each of the
>dag_id/task_id is 250 characters (luckily not 256 - you will see why!).
>When you combine dag_id, task_id, key and execution date in single
>index
>you get 4*250 + 4*250 + 4* 512 + 8 = *4056* bytes. And the max key
>length
>for MySQL is *3072* bytes.
>
>So we have to find another way or simply say we don't support mysql's
>full-UTF8-character set. :( . By the way - we are lucky that dag_id and
>task_id are 250 chars and not 256 - if we were 256, then even utf8mb3
>would
>not be supported as the index size would be *3080* bytes).
>
>
>> Irrespective of mysql's quirks, we would almost never look up rows by
>this
>> int-pk (we look up by dag+task+key, and often but not _awlays_ by
>> execution_date as well), so I don't think we'd get any other benefits
>from
>> it.
>>
>
>The benefit I spoke about is potentially much smaller index to look-up.
>In
>the current setup the primary key could be used by MSQL to do XCom
>searches
>and depending how well the optimiser works, it could actually be slower
>than using secondary index. I believe there are some good reasons why
>the
>size of index are limited (index is read in 16KB blocks at a time so
>you
>have max 5 entries for index loaded at a time with 3072 size index and
>max
>8 with 2100 one. That might affect performance of the search - I bet
>with
>such big indexes we often hit the limits of cache an a lot of the index
>is
>read from the disk rather than kept in memory which is a huge
>performance
>problem that you start to see only when you have a lot of rows (in this
>case - a lot of XCom values stored) - this is a big problem the maximum
>size of index in MySQL tries to avoid.

Experience has taught me that if you want the mysql optimiser to do something sensible: it won't. This may be better in newer versions, but was still a problem in 5.5.

Smaller index lookups from where/what? All queries from dags etc would not use `WHERE id = ?` as you detail below.

>I checked that we have those lookups for Xcom:
>
>1. ( dag_ids, task_id, key, execution_date ==  date)
>2. ( dag_ids, task_id, key, execution_date <=  date)
>3. ( dag_ids, key, execution_date ==  date) (no task_id  - for lineage)
>
>I could not find the "no execution-date" lookup for XCom. Did I miss it
>somehow ? Or maybe it's something that was in the past but it's not any
>more?

Ah no, it's probably the <= case I was thinking of (where you can select XCom for past dates too.)

>
>In 1.10 (and it is still in 2.0) we have this index :
>'idx_xcom_dag_task_date', 'xcom', ['dag_id', 'task_id',
>'execution_date'] -
>and it was used for most of the lookups (assumption was that key number
>was
>small). But in 2.0 the primary key as defined currently can (and
>probably
>will) be used for most of the lookups 1. (maybe partially for 3 as the
>primary key is (key, execution_date, task_id, dag_id)) and - depending
>how
>the optimiser is written - with this size of the index it might be
>slower
>than secondary index lookup.
>
>Surprisingly, it looks like the lineage lookup (3.) uses almost full
>table
>scan - it can at most filter the dag_ids on the index but then it has
>to
>full-scan the rest. This is something to optimise regardless.
>
>The current index won't work for sure with utf8mb4 - it fails at
>creation.
>So we have to come up with some alternative approach for both - lookups
>and
>unique check.

512 is probably longer than it needs to be, 250 could be long enough, or perhaps we have slightly different behaviour (column length) on mysql and other databases?
>
>If decreasing the size of the fields is not a good idea (maybe we
>should
>consider that!), then I think what we could do:
>
>   - hash the content of dag_id, task_id, key (or just hash the key) =
>using md5 or similar and make primary key out of that - this way we can
>   handle uniqueness check
>   - keep the old index (dag_id, task_id, execution_date)
>   - add new index for dag_id, key, execution_date - to support lineage
>   lookup (dag_id, key, execution_date) = *3056 bytes* - just below the
>   limit.
>
>Thoughts?

Another possibility would be to give a TI an integer PK, and then make XCom (ti_id, key) (fk not required).

I guess the hash approach has precedent in Airflow, but I'm not a huge fan of possible collisions, however rare (and would like to work towards removing it from the current tables.)

Ash
>
> J.
>
>
>-ash
>> On Feb 27 2020, at 1:25 pm, Jarek Potiuk <Ja...@polidea.com>
>wrote:
>> > I think the main issue here is not performance but to make it works
>and
>> be
>> > compatible with 1.10. .
>> >
>> > And indeed if we go back for I'd index we would have to check the
>impact.
>> > Was insert performance the original reason why it has changed ?
>Anyone
>> > knows ?
>> >
>> > I am not sure what it is going to be but taking into account that
>those
>> > inserts in xcom table are always individual insert (not bulks) and
>there
>> is
>> > likely more reads than writes - it might be either way easily.
>> >
>> > J..
>> > czw., 27 lut 2020, 13:55 użytkownik Kamil Breguła <
>> kamil.bregula@polidea.com>
>> > napisał:
>> >
>> > > Hello,
>> > > Creating a sequence/AUTOINCREMENT will have a very large impact
>on the
>> > > speed of adding rows.
>> > > More information:
>> https://docs.sqlalchemy.org/en/13/faq/performance.html
>> > > In the near future I wanted to speed up the saving of items by
>using a
>> > > low-level API. This will not work if we different PK.
>> > >
>> > > Best regards,
>> > > Kamil
>> > >
>> > > On Thu, Feb 27, 2020 at 1:50 PM Jarek Potiuk
><Jarek.Potiuk@polidea.com
>> >
>> > > wrote:
>> > > >
>> > > > I am testing now 5.7 in 1.10 branch with the PR here:
>> > > > https://github.com/apache/airflow/pull/7569 and it seems it's
>an
>> easy
>> > > > one one - seems it will "just work".
>> > > >
>> > > > However while testing it (I analysed a customer problem) I have
>found
>> > > > that we have an issue in 2.0 with MySQL (not only 5.7 - it's
>the same
>> > > > for 5.6 and even 8.0 (!). It boils down to an index (at least
>one) is
>> > > > to big for the recommended uf8mb4 encoding:
>> > > >
>> > > > ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id,
>task_id,
>> > > > `key`, execution_date).
>> > > >
>> > > > The utf8mb4 encoding is needed to handle 4-byte UTF characters
>> (mostly
>> > > > emojis but also some Chinese characters).
>> > > >
>> > > > The question is what do we do with that. I think we have two
>options.
>> > > > Not sure what was the story behind increasing the lengths
>(apparently
>> > > > the problem does not
>> > > >
>> > > > a) decrease sizes of some of the fields (that would make
>backwards
>> > > > incompatibility)
>> > > > b) partial indexes of the fields (not really - this is primary
>key so
>> > > > I guess it's not a good idea, although MYSQL supports indexing
>only
>> > > > part of the string - we could cut key uniqueness at 200 chars
>for
>> > > > example - but again, backwards incompatibility is a problem)
>> > > > c) revert to id - generated primary kay and make secondary
>index
>> > > > without the key - there are usually not many xcoms per task, so
>I
>> > > > would not expect any problems - actually I would expect this
>will
>> > > > speed up searches because the index will be much, much smaller
>(50%
>> > > > more or less). I think this is the most viable option.
>> > > >
>> > > > Any other ideas?
>> > > > I described all details in this issue:
>> > > > https://issues.apache.org/jira/browse/AIRFLOW-6947
>> > > >
>> > > > Here is similar story of somone who had similar issue
>> > > > https://serversforhackers.com/c/mysql-utf8-and-indexing
>> > > >
>> > > > J.
>> > > >
>> > > > Jarek Potiuk
>> > > > Polidea | Principal Software Engineer
>> > > >
>> > > > M: +48 660 796 129