You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Felix Giguere Villegas <fe...@mate1inc.com> on 2011/08/09 20:42:37 UTC

Best way to extract mysql bin logs?

Hi :) !

I have a use case where I want to keep a historical record of all the
changes (insert/update/delete) happening on a MySQL DB.

I am able to tail the bin logs and record them in HDFS, but they are not
easy to parse because one operation is split on many lines. There are some
comments that include the timestamp, the total time it took to execute the
query and other stuff. A lot of this extra info is not relevant, but the
timestamp is important for me, and I thought I might as well keep the rest
of the info as well since the raw data gives me the option of going back to
look for these other fields if I determine later on that I need them.

Now, the fact that it's split over many lines makes it harder to use with
Map/Reduce.

I have thought of using a custom M/R RecordReader but I still have the
problem that some of the lines related to one operation will be at the end
of one HDFS file and the rest will be at the beginning of the next HDFS
file, since I am opening and closing those files at an arbitrary roll time.

I think the easiest way would be to do some minimal ETL at the source. I
think I could use a custom decorator for this. Basically, that decorator
would group together on a single line all the bin log lines that relate to a
single DB operation. The original lines would be separated by semi-colons or
some other character in the final output.

I wanted to check with you guys to see if that approach made sense. If you
have better suggestions, then I'm all ears, of course. Also, if you think
there is an easier way than reading the bin logs to accomplish my original
goal, then I'd like to hear about it as well :)

Thanks :) !

--
Felix

Re: Best way to extract mysql bin logs?

Posted by Felix Giguere Villegas <fe...@mate1inc.com>.
Yes I'm planning to do that once we get something a little more concrete
working from end to end.

I'll let you guys know for sure :)

--
Felix



On Thu, Aug 11, 2011 at 7:25 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> Good to hear.  I think it would be interesting if you wrote up a little
> blog post about your usecase.  Sounds interesting!
>
> Jon.
>
>
> On Thu, Aug 11, 2011 at 2:00 PM, Felix Giguere Villegas <
> felix.giguere@mate1inc.com> wrote:
>
>> Quick update.
>>
>> Sometimes the simplest solutions are the best hehe... We decided to just
>> use a perl script to modify the data as it's being outputted by mysqlbinlog
>>
>> perl -e 'while (<STDIN>) {s/\r|\n|\r\n/ /g;s/# at/\n# at/g;print;}'
>>
>> We write this modified binlog to a file and tail it using flume. This way,
>> mysql events are grouped per line, and even if they get to their destination
>> in the wrong order, it's easy to sort them correctly downstream :)
>>
>> Eventually, we will probably wrap it all up in a script and use the execStream
>> Flume source. This way, we won't need to write an intermediate/temporary
>> file to disk, which will allow us to have greater disk IO throughput for DFO
>> or E2E reliability...
>>
>> --
>> Felix
>>
>>
>>
>>
>> On Tue, Aug 9, 2011 at 4:34 PM, Felix Giguere Villegas <
>> felix.giguere@mate1inc.com> wrote:
>>
>>> Ah, so a source would make more sense than a decorator then. I see...
>>>
>>> We are definitely open to the idea of contributing back, and this is
>>> indeed probably something that a lot of people could use...
>>>
>>> We are still evaluating what we will do, as we have a lot of stuff going
>>> on at once, but if we do decide to develop a custom source, then I'll do as
>>> you suggest and open a JIRA issue.
>>>
>>> Thanks for your time and info :)
>>>
>>> --
>>> Felix
>>>
>>>
>>>
>>>
>>> On Tue, Aug 9, 2011 at 4:06 PM, Eric Sammer <es...@cloudera.com>wrote:
>>>
>>>> Felix:
>>>>
>>>> You definitely need to implement a custom source that knows how to
>>>> read the bin logs and pack each transaction into an event rather than
>>>> just tailing it. This will give you a discreet event for each
>>>> transaction that can be treated as a single unit and make downstream
>>>> processing MUCH easier.
>>>>
>>>> Things to keep in mind:
>>>> * Flume does NOT guarantee order so make sure each event has a
>>>> timestamp or transaction ID that you can order by.
>>>> * Flume does NOT guarantee that you won't get duplicates so make sure
>>>> you have a globally unique transaction ID so you can deduplicate
>>>> transactions.
>>>>
>>>> This would be interesting functionality to get back into Flume. If you
>>>> can / want to contribute it back in the form of a custom source, feel
>>>> free to open a JIRA so others can help / watch progress.
>>>>
>>>> Thanks!
>>>>
>>>> On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
>>>> <fe...@mate1inc.com> wrote:
>>>> > Hi :) !
>>>> >
>>>> > I have a use case where I want to keep a historical record of all the
>>>> > changes (insert/update/delete) happening on a MySQL DB.
>>>> >
>>>> > I am able to tail the bin logs and record them in HDFS, but they are
>>>> not
>>>> > easy to parse because one operation is split on many lines. There are
>>>> some
>>>> > comments that include the timestamp, the total time it took to execute
>>>> the
>>>> > query and other stuff. A lot of this extra info is not relevant, but
>>>> the
>>>> > timestamp is important for me, and I thought I might as well keep the
>>>> rest
>>>> > of the info as well since the raw data gives me the option of going
>>>> back to
>>>> > look for these other fields if I determine later on that I need them.
>>>> >
>>>> > Now, the fact that it's split over many lines makes it harder to use
>>>> with
>>>> > Map/Reduce.
>>>> >
>>>> > I have thought of using a custom M/R RecordReader but I still have the
>>>> > problem that some of the lines related to one operation will be at the
>>>> end
>>>> > of one HDFS file and the rest will be at the beginning of the next
>>>> HDFS
>>>> > file, since I am opening and closing those files at an arbitrary roll
>>>> time.
>>>> >
>>>> > I think the easiest way would be to do some minimal ETL at the source.
>>>> I
>>>> > think I could use a custom decorator for this. Basically, that
>>>> decorator
>>>> > would group together on a single line all the bin log lines that
>>>> relate to a
>>>> > single DB operation. The original lines would be separated by
>>>> semi-colons or
>>>> > some other character in the final output.
>>>> >
>>>> > I wanted to check with you guys to see if that approach made sense. If
>>>> you
>>>> > have better suggestions, then I'm all ears, of course. Also, if you
>>>> think
>>>> > there is an easier way than reading the bin logs to accomplish my
>>>> original
>>>> > goal, then I'd like to hear about it as well :)
>>>> >
>>>> > Thanks :) !
>>>> >
>>>> > --
>>>> > Felix
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Eric Sammer
>>>> twitter: esammer
>>>> data: www.cloudera.com
>>>>
>>>
>>>
>>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>
>
>

Re: Best way to extract mysql bin logs?

Posted by Jonathan Hsieh <jo...@cloudera.com>.
Good to hear.  I think it would be interesting if you wrote up a little blog
post about your usecase.  Sounds interesting!

Jon.

On Thu, Aug 11, 2011 at 2:00 PM, Felix Giguere Villegas <
felix.giguere@mate1inc.com> wrote:

> Quick update.
>
> Sometimes the simplest solutions are the best hehe... We decided to just
> use a perl script to modify the data as it's being outputted by mysqlbinlog
>
> perl -e 'while (<STDIN>) {s/\r|\n|\r\n/ /g;s/# at/\n# at/g;print;}'
>
> We write this modified binlog to a file and tail it using flume. This way,
> mysql events are grouped per line, and even if they get to their destination
> in the wrong order, it's easy to sort them correctly downstream :)
>
> Eventually, we will probably wrap it all up in a script and use the execStream
> Flume source. This way, we won't need to write an intermediate/temporary
> file to disk, which will allow us to have greater disk IO throughput for DFO
> or E2E reliability...
>
> --
> Felix
>
>
>
>
> On Tue, Aug 9, 2011 at 4:34 PM, Felix Giguere Villegas <
> felix.giguere@mate1inc.com> wrote:
>
>> Ah, so a source would make more sense than a decorator then. I see...
>>
>> We are definitely open to the idea of contributing back, and this is
>> indeed probably something that a lot of people could use...
>>
>> We are still evaluating what we will do, as we have a lot of stuff going
>> on at once, but if we do decide to develop a custom source, then I'll do as
>> you suggest and open a JIRA issue.
>>
>> Thanks for your time and info :)
>>
>> --
>> Felix
>>
>>
>>
>>
>> On Tue, Aug 9, 2011 at 4:06 PM, Eric Sammer <es...@cloudera.com> wrote:
>>
>>> Felix:
>>>
>>> You definitely need to implement a custom source that knows how to
>>> read the bin logs and pack each transaction into an event rather than
>>> just tailing it. This will give you a discreet event for each
>>> transaction that can be treated as a single unit and make downstream
>>> processing MUCH easier.
>>>
>>> Things to keep in mind:
>>> * Flume does NOT guarantee order so make sure each event has a
>>> timestamp or transaction ID that you can order by.
>>> * Flume does NOT guarantee that you won't get duplicates so make sure
>>> you have a globally unique transaction ID so you can deduplicate
>>> transactions.
>>>
>>> This would be interesting functionality to get back into Flume. If you
>>> can / want to contribute it back in the form of a custom source, feel
>>> free to open a JIRA so others can help / watch progress.
>>>
>>> Thanks!
>>>
>>> On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
>>> <fe...@mate1inc.com> wrote:
>>> > Hi :) !
>>> >
>>> > I have a use case where I want to keep a historical record of all the
>>> > changes (insert/update/delete) happening on a MySQL DB.
>>> >
>>> > I am able to tail the bin logs and record them in HDFS, but they are
>>> not
>>> > easy to parse because one operation is split on many lines. There are
>>> some
>>> > comments that include the timestamp, the total time it took to execute
>>> the
>>> > query and other stuff. A lot of this extra info is not relevant, but
>>> the
>>> > timestamp is important for me, and I thought I might as well keep the
>>> rest
>>> > of the info as well since the raw data gives me the option of going
>>> back to
>>> > look for these other fields if I determine later on that I need them.
>>> >
>>> > Now, the fact that it's split over many lines makes it harder to use
>>> with
>>> > Map/Reduce.
>>> >
>>> > I have thought of using a custom M/R RecordReader but I still have the
>>> > problem that some of the lines related to one operation will be at the
>>> end
>>> > of one HDFS file and the rest will be at the beginning of the next HDFS
>>> > file, since I am opening and closing those files at an arbitrary roll
>>> time.
>>> >
>>> > I think the easiest way would be to do some minimal ETL at the source.
>>> I
>>> > think I could use a custom decorator for this. Basically, that
>>> decorator
>>> > would group together on a single line all the bin log lines that relate
>>> to a
>>> > single DB operation. The original lines would be separated by
>>> semi-colons or
>>> > some other character in the final output.
>>> >
>>> > I wanted to check with you guys to see if that approach made sense. If
>>> you
>>> > have better suggestions, then I'm all ears, of course. Also, if you
>>> think
>>> > there is an easier way than reading the bin logs to accomplish my
>>> original
>>> > goal, then I'd like to hear about it as well :)
>>> >
>>> > Thanks :) !
>>> >
>>> > --
>>> > Felix
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Eric Sammer
>>> twitter: esammer
>>> data: www.cloudera.com
>>>
>>
>>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Best way to extract mysql bin logs?

Posted by Felix Giguere Villegas <fe...@mate1inc.com>.
Quick update.

Sometimes the simplest solutions are the best hehe... We decided to just use
a perl script to modify the data as it's being outputted by mysqlbinlog

perl -e 'while (<STDIN>) {s/\r|\n|\r\n/ /g;s/# at/\n# at/g;print;}'

We write this modified binlog to a file and tail it using flume. This way,
mysql events are grouped per line, and even if they get to their destination
in the wrong order, it's easy to sort them correctly downstream :)

Eventually, we will probably wrap it all up in a script and use the execStream
Flume source. This way, we won't need to write an intermediate/temporary
file to disk, which will allow us to have greater disk IO throughput for DFO
or E2E reliability...

--
Felix



On Tue, Aug 9, 2011 at 4:34 PM, Felix Giguere Villegas <
felix.giguere@mate1inc.com> wrote:

> Ah, so a source would make more sense than a decorator then. I see...
>
> We are definitely open to the idea of contributing back, and this is indeed
> probably something that a lot of people could use...
>
> We are still evaluating what we will do, as we have a lot of stuff going on
> at once, but if we do decide to develop a custom source, then I'll do as you
> suggest and open a JIRA issue.
>
> Thanks for your time and info :)
>
> --
> Felix
>
>
>
>
> On Tue, Aug 9, 2011 at 4:06 PM, Eric Sammer <es...@cloudera.com> wrote:
>
>> Felix:
>>
>> You definitely need to implement a custom source that knows how to
>> read the bin logs and pack each transaction into an event rather than
>> just tailing it. This will give you a discreet event for each
>> transaction that can be treated as a single unit and make downstream
>> processing MUCH easier.
>>
>> Things to keep in mind:
>> * Flume does NOT guarantee order so make sure each event has a
>> timestamp or transaction ID that you can order by.
>> * Flume does NOT guarantee that you won't get duplicates so make sure
>> you have a globally unique transaction ID so you can deduplicate
>> transactions.
>>
>> This would be interesting functionality to get back into Flume. If you
>> can / want to contribute it back in the form of a custom source, feel
>> free to open a JIRA so others can help / watch progress.
>>
>> Thanks!
>>
>> On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
>> <fe...@mate1inc.com> wrote:
>> > Hi :) !
>> >
>> > I have a use case where I want to keep a historical record of all the
>> > changes (insert/update/delete) happening on a MySQL DB.
>> >
>> > I am able to tail the bin logs and record them in HDFS, but they are not
>> > easy to parse because one operation is split on many lines. There are
>> some
>> > comments that include the timestamp, the total time it took to execute
>> the
>> > query and other stuff. A lot of this extra info is not relevant, but the
>> > timestamp is important for me, and I thought I might as well keep the
>> rest
>> > of the info as well since the raw data gives me the option of going back
>> to
>> > look for these other fields if I determine later on that I need them.
>> >
>> > Now, the fact that it's split over many lines makes it harder to use
>> with
>> > Map/Reduce.
>> >
>> > I have thought of using a custom M/R RecordReader but I still have the
>> > problem that some of the lines related to one operation will be at the
>> end
>> > of one HDFS file and the rest will be at the beginning of the next HDFS
>> > file, since I am opening and closing those files at an arbitrary roll
>> time.
>> >
>> > I think the easiest way would be to do some minimal ETL at the source. I
>> > think I could use a custom decorator for this. Basically, that decorator
>> > would group together on a single line all the bin log lines that relate
>> to a
>> > single DB operation. The original lines would be separated by
>> semi-colons or
>> > some other character in the final output.
>> >
>> > I wanted to check with you guys to see if that approach made sense. If
>> you
>> > have better suggestions, then I'm all ears, of course. Also, if you
>> think
>> > there is an easier way than reading the bin logs to accomplish my
>> original
>> > goal, then I'd like to hear about it as well :)
>> >
>> > Thanks :) !
>> >
>> > --
>> > Felix
>> >
>> >
>>
>>
>>
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com
>>
>
>

Re: Best way to extract mysql bin logs?

Posted by Felix Giguere Villegas <fe...@mate1inc.com>.
Ah, so a source would make more sense than a decorator then. I see...

We are definitely open to the idea of contributing back, and this is indeed
probably something that a lot of people could use...

We are still evaluating what we will do, as we have a lot of stuff going on
at once, but if we do decide to develop a custom source, then I'll do as you
suggest and open a JIRA issue.

Thanks for your time and info :)

--
Felix



On Tue, Aug 9, 2011 at 4:06 PM, Eric Sammer <es...@cloudera.com> wrote:

> Felix:
>
> You definitely need to implement a custom source that knows how to
> read the bin logs and pack each transaction into an event rather than
> just tailing it. This will give you a discreet event for each
> transaction that can be treated as a single unit and make downstream
> processing MUCH easier.
>
> Things to keep in mind:
> * Flume does NOT guarantee order so make sure each event has a
> timestamp or transaction ID that you can order by.
> * Flume does NOT guarantee that you won't get duplicates so make sure
> you have a globally unique transaction ID so you can deduplicate
> transactions.
>
> This would be interesting functionality to get back into Flume. If you
> can / want to contribute it back in the form of a custom source, feel
> free to open a JIRA so others can help / watch progress.
>
> Thanks!
>
> On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
> <fe...@mate1inc.com> wrote:
> > Hi :) !
> >
> > I have a use case where I want to keep a historical record of all the
> > changes (insert/update/delete) happening on a MySQL DB.
> >
> > I am able to tail the bin logs and record them in HDFS, but they are not
> > easy to parse because one operation is split on many lines. There are
> some
> > comments that include the timestamp, the total time it took to execute
> the
> > query and other stuff. A lot of this extra info is not relevant, but the
> > timestamp is important for me, and I thought I might as well keep the
> rest
> > of the info as well since the raw data gives me the option of going back
> to
> > look for these other fields if I determine later on that I need them.
> >
> > Now, the fact that it's split over many lines makes it harder to use with
> > Map/Reduce.
> >
> > I have thought of using a custom M/R RecordReader but I still have the
> > problem that some of the lines related to one operation will be at the
> end
> > of one HDFS file and the rest will be at the beginning of the next HDFS
> > file, since I am opening and closing those files at an arbitrary roll
> time.
> >
> > I think the easiest way would be to do some minimal ETL at the source. I
> > think I could use a custom decorator for this. Basically, that decorator
> > would group together on a single line all the bin log lines that relate
> to a
> > single DB operation. The original lines would be separated by semi-colons
> or
> > some other character in the final output.
> >
> > I wanted to check with you guys to see if that approach made sense. If
> you
> > have better suggestions, then I'm all ears, of course. Also, if you think
> > there is an easier way than reading the bin logs to accomplish my
> original
> > goal, then I'd like to hear about it as well :)
> >
> > Thanks :) !
> >
> > --
> > Felix
> >
> >
>
>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com
>

Re: Best way to extract mysql bin logs?

Posted by Eric Sammer <es...@cloudera.com>.
Felix:

You definitely need to implement a custom source that knows how to
read the bin logs and pack each transaction into an event rather than
just tailing it. This will give you a discreet event for each
transaction that can be treated as a single unit and make downstream
processing MUCH easier.

Things to keep in mind:
* Flume does NOT guarantee order so make sure each event has a
timestamp or transaction ID that you can order by.
* Flume does NOT guarantee that you won't get duplicates so make sure
you have a globally unique transaction ID so you can deduplicate
transactions.

This would be interesting functionality to get back into Flume. If you
can / want to contribute it back in the form of a custom source, feel
free to open a JIRA so others can help / watch progress.

Thanks!

On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
<fe...@mate1inc.com> wrote:
> Hi :) !
>
> I have a use case where I want to keep a historical record of all the
> changes (insert/update/delete) happening on a MySQL DB.
>
> I am able to tail the bin logs and record them in HDFS, but they are not
> easy to parse because one operation is split on many lines. There are some
> comments that include the timestamp, the total time it took to execute the
> query and other stuff. A lot of this extra info is not relevant, but the
> timestamp is important for me, and I thought I might as well keep the rest
> of the info as well since the raw data gives me the option of going back to
> look for these other fields if I determine later on that I need them.
>
> Now, the fact that it's split over many lines makes it harder to use with
> Map/Reduce.
>
> I have thought of using a custom M/R RecordReader but I still have the
> problem that some of the lines related to one operation will be at the end
> of one HDFS file and the rest will be at the beginning of the next HDFS
> file, since I am opening and closing those files at an arbitrary roll time.
>
> I think the easiest way would be to do some minimal ETL at the source. I
> think I could use a custom decorator for this. Basically, that decorator
> would group together on a single line all the bin log lines that relate to a
> single DB operation. The original lines would be separated by semi-colons or
> some other character in the final output.
>
> I wanted to check with you guys to see if that approach made sense. If you
> have better suggestions, then I'm all ears, of course. Also, if you think
> there is an easier way than reading the bin logs to accomplish my original
> goal, then I'd like to hear about it as well :)
>
> Thanks :) !
>
> --
> Felix
>
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com