You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Chris Riccomini <cr...@apache.org> on 2016/02/24 01:26:27 UTC

Kafka connect JDBC DB SCNs

Hey all,

I was reviewing the Kafka connect JDBC driver, and I had a question. Is it
possible to use the JDBC driver with a look-back configured? The reason
that I ask is that there are some known issues with using a modified
timestamp:

Slide 14 here explains one with Oracle:

https://qconsf.com/sf2007/dl/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf?path=/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf

There is also some SCN-related discussion here:

https://github.com/linkedin/databus/wiki/Databus-for-MySQL

Though that is more specific to MySQL.

I am concerned that using insert IDs might not even be good enough
(assuming my tables were immutable, which they're not), since I believe
some DB storage systems might have the same issue. I think InnoDB's pkey
auto increment ID commit order is even tunable based on config.

I would rather get some duplicates than lose data if at all possible. Can I
configure the JDBC driver to subtract some number from the offset to
prevent (or drastically reduce) lost data?

Cheers,
Chris

Re: Kafka connect JDBC DB SCNs

Posted by Chris Riccomini <cr...@apache.org>.

Given how messed up MySQL is, in particular, it seems to me like adding a
config that can do a loobback (either ID or timestamp-based) would we
useful, and pretty straight forward to add. This will at least help in
preventing data loss, at the expense of duplicate rows. If people *really*
care about not losing data, they can configure a second connector to do
bulk loads on a more periodic basis. Thoughts?

On Wed, Feb 24, 2016 at 12:41 AM, Chris Riccomini <cr...@apache.org>
wrote:

> Hey all,
>
> Some more details on my question. My concern about using even IDs on
> immutable tables as a way to replicate data seems to be confirmed, at least
> for InnoDB if not configured properly. This page:
>
> http://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html
>
> Describes how auto increment IDs can get out of order. I have replicated
> this issue on my local machine (see example below the fold) with all three
> innodb_autoinc_lock_modes (1, 2, and 3). They all show the same results.
>
> As far as I can tell, at least with MySQL+InnoDB, it seems to me that the
> JDBC connector has no way to configure it such that it isn't at risk of
> dropping data on incremental DB polls (whether you use incrementing,
> timestamp, or timestamp+incrementing).
>
> I'm curious if anyone has any thoughts on this. Again, the link I showed
> in the above email suggests that this is a problem with Oracle as well. Are
> people seeing data loss with the JDBC driver on high-write-load tables? I
> would be surprised if they weren't.
>
> Several solutions covered in that PDF are of interest:
>
> 1. The look-back based approach (NOW() - 1 minute, or MAX(id) - 100). This
> will lead to significant duplication in the Kafka topic. Even with log
> compaction enabled, this will be annoying to users that are reading the
> feed in realtime.
> 2. Some kind of pre-query assignment. Essentially a field that's set to
> NULL on insert. Every update sets the field back to NULL. An async process
> periodically wakes up and sets the field to a monotonically increasing ID.
> In the PDF, they suggest SCN for Oracle. MySQL doesn't easily expose this,
> but I think UNIX_TIMESTAMP() would suffice as long as the updates don't
> happen too frequently.
>
> For (2), one idea would be to add the ability int he JDBC connector to add
> a pre-query, which could be executed just before the SELECT. In this mode,
> you could execute an UPDATE my_table SET scn = UNIX_TIMESTAMP() WHERE scn =
> NULL. As far as I can tell, with MySQL, this would involve using a trigger
> to update the `scn` field back to NULL whenever an update on the row occurs.
>
> Cheers,
> Chris
>
> ----------
>
> #### Start with an empty table
>
> mysql> select * from funding_instructions;
>
> Empty set (0.00 sec)
>
> #### Create a transaction, and insert a row with state=2, but don't commit
> (this is done in another terminal).
>
> mysql> begin;
>
> Query OK, 0 rows affected (0.00 sec)
>
> mysql> insert into funding_instructions (state) VALUES (2);
>
> Query OK, 1 row affected, 4 warnings (0.00 sec)
>
> #### Outside of the transaction in the main terminal, insert a second row
> with state=3
>
> mysql> insert into funding_instructions (state) VALUES (3);
>
> Query OK, 1 row affected, 4 warnings (0.00 sec)
>
> #### Now you see only the second row that was inserted. Note the ID.
>
> mysql> select * from funding_instructions;
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> | id       | state | type | amount | currency | batch | litle_id |
> modify_time | create_time | version | payment_id | mid  |
> funds_transfer_request_date |
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> | 25607888 |     3 |    0 |      0 | XXX      | NULL  | NULL     |
>   0 |           0 |       0 |          0 | NULL |
> NULL |
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> 1 row in set (0.00 sec)
>
> #### Now commit the first transaction.
>
> mysql> commit;
>
> Query OK, 0 rows affected (0.00 sec)
>
> #### Now look at your results. Note that the smaller ID now appears. This
> breaks strict ID-based replication, even if the table is append-only.
>
> mysql> select * from funding_instructions;
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> | id       | state | type | amount | currency | batch | litle_id |
> modify_time | create_time | version | payment_id | mid  |
> funds_transfer_request_date |
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> | 25607887 |     2 |    0 |      0 | XXX      | NULL  | NULL     |
>   0 |           0 |       0 |          0 | NULL |
> NULL |
>
> | 25607888 |     3 |    0 |      0 | XXX      | NULL  | NULL     |
>   0 |           0 |       0 |          0 | NULL |
> NULL |
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> On Tue, Feb 23, 2016 at 4:26 PM, Chris Riccomini <cr...@apache.org>
> wrote:
>
>> Hey all,
>>
>> I was reviewing the Kafka connect JDBC driver, and I had a question. Is
>> it possible to use the JDBC driver with a look-back configured? The reason
>> that I ask is that there are some known issues with using a modified
>> timestamp:
>>
>> Slide 14 here explains one with Oracle:
>>
>>
>> https://qconsf.com/sf2007/dl/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf?path=/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf
>>
>> There is also some SCN-related discussion here:
>>
>> https://github.com/linkedin/databus/wiki/Databus-for-MySQL
>>
>> Though that is more specific to MySQL.
>>
>> I am concerned that using insert IDs might not even be good enough
>> (assuming my tables were immutable, which they're not), since I believe
>> some DB storage systems might have the same issue. I think InnoDB's pkey
>> auto increment ID commit order is even tunable based on config.
>>
>> I would rather get some duplicates than lose data if at all possible. Can
>> I configure the JDBC driver to subtract some number from the offset to
>> prevent (or drastically reduce) lost data?
>>
>> Cheers,
>> Chris
>>
>
>

Re: Kafka connect JDBC DB SCNs

Posted by Chris Riccomini <cr...@apache.org>.

Hey all,

Some more details on my question. My concern about using even IDs on
immutable tables as a way to replicate data seems to be confirmed, at least
for InnoDB if not configured properly. This page:

http://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html

Describes how auto increment IDs can get out of order. I have replicated
this issue on my local machine (see example below the fold) with all three
innodb_autoinc_lock_modes (1, 2, and 3). They all show the same results.

As far as I can tell, at least with MySQL+InnoDB, it seems to me that the
JDBC connector has no way to configure it such that it isn't at risk of
dropping data on incremental DB polls (whether you use incrementing,
timestamp, or timestamp+incrementing).

I'm curious if anyone has any thoughts on this. Again, the link I showed in
the above email suggests that this is a problem with Oracle as well. Are
people seeing data loss with the JDBC driver on high-write-load tables? I
would be surprised if they weren't.

Several solutions covered in that PDF are of interest:

1. The look-back based approach (NOW() - 1 minute, or MAX(id) - 100). This
will lead to significant duplication in the Kafka topic. Even with log
compaction enabled, this will be annoying to users that are reading the
feed in realtime.
2. Some kind of pre-query assignment. Essentially a field that's set to
NULL on insert. Every update sets the field back to NULL. An async process
periodically wakes up and sets the field to a monotonically increasing ID.
In the PDF, they suggest SCN for Oracle. MySQL doesn't easily expose this,
but I think UNIX_TIMESTAMP() would suffice as long as the updates don't
happen too frequently.

For (2), one idea would be to add the ability int he JDBC connector to add
a pre-query, which could be executed just before the SELECT. In this mode,
you could execute an UPDATE my_table SET scn = UNIX_TIMESTAMP() WHERE scn =
NULL. As far as I can tell, with MySQL, this would involve using a trigger
to update the `scn` field back to NULL whenever an update on the row occurs.

Cheers,
Chris

----------

#### Start with an empty table

mysql> select * from funding_instructions;

Empty set (0.00 sec)

#### Create a transaction, and insert a row with state=2, but don't commit
(this is done in another terminal).

mysql> begin;

Query OK, 0 rows affected (0.00 sec)

mysql> insert into funding_instructions (state) VALUES (2);

Query OK, 1 row affected, 4 warnings (0.00 sec)

#### Outside of the transaction in the main terminal, insert a second row
with state=3

mysql> insert into funding_instructions (state) VALUES (3);

Query OK, 1 row affected, 4 warnings (0.00 sec)

#### Now you see only the second row that was inserted. Note the ID.

mysql> select * from funding_instructions;

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

| id       | state | type | amount | currency | batch | litle_id |
modify_time | create_time | version | payment_id | mid  |
funds_transfer_request_date |

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

| 25607888 |     3 |    0 |      0 | XXX      | NULL  | NULL     |
  0 |           0 |       0 |          0 | NULL |
NULL |

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

1 row in set (0.00 sec)

#### Now commit the first transaction.

mysql> commit;

Query OK, 0 rows affected (0.00 sec)

#### Now look at your results. Note that the smaller ID now appears. This
breaks strict ID-based replication, even if the table is append-only.

mysql> select * from funding_instructions;

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

| id       | state | type | amount | currency | batch | litle_id |
modify_time | create_time | version | payment_id | mid  |
funds_transfer_request_date |

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

| 25607887 |     2 |    0 |      0 | XXX      | NULL  | NULL     |
  0 |           0 |       0 |          0 | NULL |
NULL |

| 25607888 |     3 |    0 |      0 | XXX      | NULL  | NULL     |
  0 |           0 |       0 |          0 | NULL |
NULL |

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

On Tue, Feb 23, 2016 at 4:26 PM, Chris Riccomini <cr...@apache.org>
wrote:

> Hey all,
>
> I was reviewing the Kafka connect JDBC driver, and I had a question. Is it
> possible to use the JDBC driver with a look-back configured? The reason
> that I ask is that there are some known issues with using a modified
> timestamp:
>
> Slide 14 here explains one with Oracle:
>
>
> https://qconsf.com/sf2007/dl/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf?path=/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf
>
> There is also some SCN-related discussion here:
>
> https://github.com/linkedin/databus/wiki/Databus-for-MySQL
>
> Though that is more specific to MySQL.
>
> I am concerned that using insert IDs might not even be good enough
> (assuming my tables were immutable, which they're not), since I believe
> some DB storage systems might have the same issue. I think InnoDB's pkey
> auto increment ID commit order is even tunable based on config.
>
> I would rather get some duplicates than lose data if at all possible. Can
> I configure the JDBC driver to subtract some number from the offset to
> prevent (or drastically reduce) lost data?
>
> Cheers,
> Chris
>