You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by "wanglei2@geekplus.com.cn" <wa...@geekplus.com.cn> on 2019/10/15 00:08:52 UTC

Data inconsistency happens when using CDC to replicate my database

Using CaptureChangeMySQL to extract binlog, do some translation and then put to another database with PutDatabaseRecord processor.
But there's always data inconsitency between destination database and souce database. To debug this, I have do the following settings. 
CaptureChangeMySQL only output one table. There's a field called order_no that is uniq in the table.
All the proessors are scheduled with only one concurrency.
No data balance between nodes. All run on primary node
After CaptureChangeMySQL, add a LogAttrubute processor called log1. Before PutDatabaseRecord, also add a LogAttrubute, called log2. 
For the inconsistent data, i can  grep the order_no in log1 and log2. 
For one specified order_no, there's total 5  binlog message. But in log1, there's only one message. In log2, there's 5, but the order is changed. 

position       type
201721167  insert (appeared in log1 and log2)
201926490  update(appeared only in log2)
202728760  update(appeared only in log2)
203162806  update(appeared only in log2)
203135127  update (appeared only in log2, the position number is smaller then privious msg)

This really confused me a lot.
Any insight on this?  Thanks very much.

Lei



wanglei2@geekplus.com.cn

Re: Data inconsistency happens when using CDC to replicate my database

Posted by Koji Kawamura <ij...@gmail.com>.
Hi Lei,

To address FlowFile ordering issue related to CaptureChangeMySQL, I'd
recommend using EnforceOrder processor and FIFO prioritizer before a
processor that requires precise ordering. EnforceOrder can use
"cdc.sequence.id" attribute.

Thanks,
Koji

On Tue, Oct 15, 2019 at 1:14 PM wanglei2@geekplus.com.cn
<wa...@geekplus.com.cn> wrote:
>
>
> Seems it is related with which prioritizer is used.
> The inconsistency accurs when OldestFlowFileFirst prioritizer is used, but not accur when FirstInFristOut prioritizer is used.
> But I have no idea why.
> Any insight on this?
>
> Thanks,
> Lei
>
>
> ________________________________
> wanglei2@geekplus.com.cn
>
>
> 发件人: wanglei2@geekplus.com.cn
> 发送时间: 2019-10-15 08:08
> 收件人: users
> 抄送: dev
> 主题: Data inconsistency happens when using CDC to replicate my database
> Using CaptureChangeMySQL to extract binlog, do some translation and then put to another database with PutDatabaseRecord processor.
> But there's always data inconsitency between destination database and souce database. To debug this, I have do the following settings.
>
> CaptureChangeMySQL only output one table. There's a field called order_no that is uniq in the table.
> All the proessors are scheduled with only one concurrency.
> No data balance between nodes. All run on primary node
> After CaptureChangeMySQL, add a LogAttrubute processor called log1. Before PutDatabaseRecord, also add a LogAttrubute, called log2.
>
> For the inconsistent data, i can  grep the order_no in log1 and log2.
> For one specified order_no, there's total 5  binlog message. But in log1, there's only one message. In log2, there's 5, but the order is changed.
>
> position       type
> 201721167  insert (appeared in log1 and log2)
> 201926490  update(appeared only in log2)
> 202728760  update(appeared only in log2)
> 203162806  update(appeared only in log2)
> 203135127  update (appeared only in log2, the position number is smaller then privious msg)
>
> This really confused me a lot.
> Any insight on this?  Thanks very much.
>
> Lei
>
> ________________________________
> wanglei2@geekplus.com.cn

Re: Data inconsistency happens when using CDC to replicate my database

Posted by Koji Kawamura <ij...@gmail.com>.
Hi Lei,

To address FlowFile ordering issue related to CaptureChangeMySQL, I'd
recommend using EnforceOrder processor and FIFO prioritizer before a
processor that requires precise ordering. EnforceOrder can use
"cdc.sequence.id" attribute.

Thanks,
Koji

On Tue, Oct 15, 2019 at 1:14 PM wanglei2@geekplus.com.cn
<wa...@geekplus.com.cn> wrote:
>
>
> Seems it is related with which prioritizer is used.
> The inconsistency accurs when OldestFlowFileFirst prioritizer is used, but not accur when FirstInFristOut prioritizer is used.
> But I have no idea why.
> Any insight on this?
>
> Thanks,
> Lei
>
>
> ________________________________
> wanglei2@geekplus.com.cn
>
>
> 发件人: wanglei2@geekplus.com.cn
> 发送时间: 2019-10-15 08:08
> 收件人: users
> 抄送: dev
> 主题: Data inconsistency happens when using CDC to replicate my database
> Using CaptureChangeMySQL to extract binlog, do some translation and then put to another database with PutDatabaseRecord processor.
> But there's always data inconsitency between destination database and souce database. To debug this, I have do the following settings.
>
> CaptureChangeMySQL only output one table. There's a field called order_no that is uniq in the table.
> All the proessors are scheduled with only one concurrency.
> No data balance between nodes. All run on primary node
> After CaptureChangeMySQL, add a LogAttrubute processor called log1. Before PutDatabaseRecord, also add a LogAttrubute, called log2.
>
> For the inconsistent data, i can  grep the order_no in log1 and log2.
> For one specified order_no, there's total 5  binlog message. But in log1, there's only one message. In log2, there's 5, but the order is changed.
>
> position       type
> 201721167  insert (appeared in log1 and log2)
> 201926490  update(appeared only in log2)
> 202728760  update(appeared only in log2)
> 203162806  update(appeared only in log2)
> 203135127  update (appeared only in log2, the position number is smaller then privious msg)
>
> This really confused me a lot.
> Any insight on this?  Thanks very much.
>
> Lei
>
> ________________________________
> wanglei2@geekplus.com.cn

回复: Data inconsistency happens when using CDC to replicate my database

Posted by "wanglei2@geekplus.com.cn" <wa...@geekplus.com.cn>.
Seems it is related with which prioritizer is used.
The inconsistency accurs when OldestFlowFileFirst prioritizer is used, but not accur when FirstInFristOut prioritizer is used.
But I have no idea why. 
Any insight on this?

Thanks,
Lei




wanglei2@geekplus.com.cn
 
发件人: wanglei2@geekplus.com.cn
发送时间: 2019-10-15 08:08
收件人: users
抄送: dev
主题: Data inconsistency happens when using CDC to replicate my database
Using CaptureChangeMySQL to extract binlog, do some translation and then put to another database with PutDatabaseRecord processor.
But there's always data inconsitency between destination database and souce database. To debug this, I have do the following settings. 
CaptureChangeMySQL only output one table. There's a field called order_no that is uniq in the table.
All the proessors are scheduled with only one concurrency.
No data balance between nodes. All run on primary node
After CaptureChangeMySQL, add a LogAttrubute processor called log1. Before PutDatabaseRecord, also add a LogAttrubute, called log2. 
For the inconsistent data, i can  grep the order_no in log1 and log2. 
For one specified order_no, there's total 5  binlog message. But in log1, there's only one message. In log2, there's 5, but the order is changed. 

position       type
201721167  insert (appeared in log1 and log2)
201926490  update(appeared only in log2)
202728760  update(appeared only in log2)
203162806  update(appeared only in log2)
203135127  update (appeared only in log2, the position number is smaller then privious msg)

This really confused me a lot.
Any insight on this?  Thanks very much.

Lei



wanglei2@geekplus.com.cn

回复: Data inconsistency happens when using CDC to replicate my database

Posted by "wanglei2@geekplus.com.cn" <wa...@geekplus.com.cn>.
Seems it is related with which prioritizer is used.
The inconsistency accurs when OldestFlowFileFirst prioritizer is used, but not accur when FirstInFristOut prioritizer is used.
But I have no idea why. 
Any insight on this?

Thanks,
Lei




wanglei2@geekplus.com.cn
 
发件人: wanglei2@geekplus.com.cn
发送时间: 2019-10-15 08:08
收件人: users
抄送: dev
主题: Data inconsistency happens when using CDC to replicate my database
Using CaptureChangeMySQL to extract binlog, do some translation and then put to another database with PutDatabaseRecord processor.
But there's always data inconsitency between destination database and souce database. To debug this, I have do the following settings. 
CaptureChangeMySQL only output one table. There's a field called order_no that is uniq in the table.
All the proessors are scheduled with only one concurrency.
No data balance between nodes. All run on primary node
After CaptureChangeMySQL, add a LogAttrubute processor called log1. Before PutDatabaseRecord, also add a LogAttrubute, called log2. 
For the inconsistent data, i can  grep the order_no in log1 and log2. 
For one specified order_no, there's total 5  binlog message. But in log1, there's only one message. In log2, there's 5, but the order is changed. 

position       type
201721167  insert (appeared in log1 and log2)
201926490  update(appeared only in log2)
202728760  update(appeared only in log2)
203162806  update(appeared only in log2)
203135127  update (appeared only in log2, the position number is smaller then privious msg)

This really confused me a lot.
Any insight on this?  Thanks very much.

Lei



wanglei2@geekplus.com.cn