You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by tobe <to...@gmail.com> on 2014/08/12 05:55:54 UTC

How to verify data from MySQL and HBase

Most of our users migrated their data form MySQL to HBase. Before they
totally trust HBase, they use MySQL and HBase at the same time. Sometimes
the data is inconsistent because they use it incorrectly or maybe there're
bugs of HBase. Anyway, we have to make sure the data from MySQL and HBase
is consistent.

So how can we do that? Write a simple script or is there any general method?

Re: How to verify data from MySQL and HBase

Posted by tobe <to...@gmail.com>.
Thank @lars for the suggestions.

We will randomly pick some columns from MySQL, then compare with their
values in HBase. Because the data is constantly increasing, we will not
verify all the data.


On Wed, Aug 13, 2014 at 1:14 PM, lars hofhansl <la...@apache.org> wrote:

> Just in the interest of stating the obvious: Don't write a tool that scans
> through all the data in MySQL (or HBase) and then looks up each individual
> row (or even batches of rows) in the other store. That is very inefficient
> if you have a lot of data.
>
> Do it like a merge-join instead: Get sorted results from MySQL such that
> they sort in the same order as the HBase key. Then read through those
> results and the HBase scan at the same time, advancing both sides together.
> That way you need only a single scan on both sides (per table).
>
> -- Lars
>
>
>
> ________________________________
>  From: tobe <to...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Tuesday, August 12, 2014 7:01 AM
> Subject: Re: How to verify data from MySQL and HBase
>
>
> Thank @JM for the detailed explanation.
>
> I totally agree with you and we're developing an internal tool to do it.
> It's not so general because we have to write the sql and generate the row
> key manually. But it works and easy to understand. I would like to share
> with anybody who also needs it.
>
>
>
>
>
>
> On Tue, Aug 12, 2014 at 8:45 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Tobe,
> >
> > Thing is, your data in HBase might be organize very differently than in
> > MySQL. Have you denormlized some of it? Have you used some Avro
> containers
> > into an HBase cell? Have you do any cleanup? Or enrichment? At the end,
> it
> > my be very different that what is stored into MySQL. There is not any
> easy
> > process to validate that migration as been done correctly between the two
> > databases because tools don't know the transformations you applied. I
> think
> > you will have to build someone internally which will do this validation
> > because not such tool exist today.
> >
> > You can use the row counters to count the rows, but will that mean
> content
> > is correct? No. Calculate a CRC on the cells value? Not even, because you
> > might have denormalized.
> >
> > Sorry, but you will have to do some coding here I think.
> >
> > JM
> >
> > JM
> >
> >
> > 2014-08-12 3:49 GMT-04:00 tobe <to...@gmail.com>:
> >
> > > Thanks for replaying. @Serega
> > >
> > > I know MySQL and HBase are reliable. What I want to validate is the
> date
> > in
> > > both MySQL and HBase. The upper-stream application and unexpected
> > operation
> > > may make it inconsistent.
> > >
> > > I'm also wondering could sqoop validate the values from each database?
> > > There's RowCountValidator but it's not suitable for us.
> > >
> > >
> > > On Tue, Aug 12, 2014 at 3:23 PM, Serega Sheypak <
> > serega.sheypak@gmail.com>
> > > wrote:
> > >
> > > > you should design resilient ETL-processes. Also introduce post-ETL
> > > checks.
> > > > There is no need to test MySQL or HBase. They are already tested.
> > > > See this:
> > > >
> > >
> >
> http://www.slideshare.net/wyaddow/data-verification-in-qa-department-final
> > > > Pretty old, but gives basic ideas. Nothing changed from that time.
> > > >
> > > >
> > > > 2014-08-12 7:55 GMT+04:00 tobe <to...@gmail.com>:
> > > >
> > > > > Most of our users migrated their data form MySQL to HBase. Before
> > they
> > > > > totally trust HBase, they use MySQL and HBase at the same time.
> > > Sometimes
> > > > > the data is inconsistent because they use it incorrectly or maybe
> > > > there're
> > > > > bugs of HBase. Anyway, we have to make sure the data from MySQL and
> > > HBase
> > > > > is consistent.
> > > > >
> > > > > So how can we do that? Write a simple script or is there any
> general
> > > > > method?
> > > > >
> > > >
> > >
> >
>

Re: How to verify data from MySQL and HBase

Posted by lars hofhansl <la...@apache.org>.
Just in the interest of stating the obvious: Don't write a tool that scans through all the data in MySQL (or HBase) and then looks up each individual row (or even batches of rows) in the other store. That is very inefficient if you have a lot of data.

Do it like a merge-join instead: Get sorted results from MySQL such that they sort in the same order as the HBase key. Then read through those results and the HBase scan at the same time, advancing both sides together. That way you need only a single scan on both sides (per table).

-- Lars



________________________________
 From: tobe <to...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Tuesday, August 12, 2014 7:01 AM
Subject: Re: How to verify data from MySQL and HBase
 

Thank @JM for the detailed explanation.

I totally agree with you and we're developing an internal tool to do it.
It's not so general because we have to write the sql and generate the row
key manually. But it works and easy to understand. I would like to share
with anybody who also needs it.






On Tue, Aug 12, 2014 at 8:45 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Tobe,
>
> Thing is, your data in HBase might be organize very differently than in
> MySQL. Have you denormlized some of it? Have you used some Avro containers
> into an HBase cell? Have you do any cleanup? Or enrichment? At the end, it
> my be very different that what is stored into MySQL. There is not any easy
> process to validate that migration as been done correctly between the two
> databases because tools don't know the transformations you applied. I think
> you will have to build someone internally which will do this validation
> because not such tool exist today.
>
> You can use the row counters to count the rows, but will that mean content
> is correct? No. Calculate a CRC on the cells value? Not even, because you
> might have denormalized.
>
> Sorry, but you will have to do some coding here I think.
>
> JM
>
> JM
>
>
> 2014-08-12 3:49 GMT-04:00 tobe <to...@gmail.com>:
>
> > Thanks for replaying. @Serega
> >
> > I know MySQL and HBase are reliable. What I want to validate is the date
> in
> > both MySQL and HBase. The upper-stream application and unexpected
> operation
> > may make it inconsistent.
> >
> > I'm also wondering could sqoop validate the values from each database?
> > There's RowCountValidator but it's not suitable for us.
> >
> >
> > On Tue, Aug 12, 2014 at 3:23 PM, Serega Sheypak <
> serega.sheypak@gmail.com>
> > wrote:
> >
> > > you should design resilient ETL-processes. Also introduce post-ETL
> > checks.
> > > There is no need to test MySQL or HBase. They are already tested.
> > > See this:
> > >
> >
> http://www.slideshare.net/wyaddow/data-verification-in-qa-department-final
> > > Pretty old, but gives basic ideas. Nothing changed from that time.
> > >
> > >
> > > 2014-08-12 7:55 GMT+04:00 tobe <to...@gmail.com>:
> > >
> > > > Most of our users migrated their data form MySQL to HBase. Before
> they
> > > > totally trust HBase, they use MySQL and HBase at the same time.
> > Sometimes
> > > > the data is inconsistent because they use it incorrectly or maybe
> > > there're
> > > > bugs of HBase. Anyway, we have to make sure the data from MySQL and
> > HBase
> > > > is consistent.
> > > >
> > > > So how can we do that? Write a simple script or is there any general
> > > > method?
> > > >
> > >
> >
>

Re: How to verify data from MySQL and HBase

Posted by tobe <to...@gmail.com>.
Thank @JM for the detailed explanation.

I totally agree with you and we're developing an internal tool to do it.
It's not so general because we have to write the sql and generate the row
key manually. But it works and easy to understand. I would like to share
with anybody who also needs it.



On Tue, Aug 12, 2014 at 8:45 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Tobe,
>
> Thing is, your data in HBase might be organize very differently than in
> MySQL. Have you denormlized some of it? Have you used some Avro containers
> into an HBase cell? Have you do any cleanup? Or enrichment? At the end, it
> my be very different that what is stored into MySQL. There is not any easy
> process to validate that migration as been done correctly between the two
> databases because tools don't know the transformations you applied. I think
> you will have to build someone internally which will do this validation
> because not such tool exist today.
>
> You can use the row counters to count the rows, but will that mean content
> is correct? No. Calculate a CRC on the cells value? Not even, because you
> might have denormalized.
>
> Sorry, but you will have to do some coding here I think.
>
> JM
>
> JM
>
>
> 2014-08-12 3:49 GMT-04:00 tobe <to...@gmail.com>:
>
> > Thanks for replaying. @Serega
> >
> > I know MySQL and HBase are reliable. What I want to validate is the date
> in
> > both MySQL and HBase. The upper-stream application and unexpected
> operation
> > may make it inconsistent.
> >
> > I'm also wondering could sqoop validate the values from each database?
> > There's RowCountValidator but it's not suitable for us.
> >
> >
> > On Tue, Aug 12, 2014 at 3:23 PM, Serega Sheypak <
> serega.sheypak@gmail.com>
> > wrote:
> >
> > > you should design resilient ETL-processes. Also introduce post-ETL
> > checks.
> > > There is no need to test MySQL or HBase. They are already tested.
> > > See this:
> > >
> >
> http://www.slideshare.net/wyaddow/data-verification-in-qa-department-final
> > > Pretty old, but gives basic ideas. Nothing changed from that time.
> > >
> > >
> > > 2014-08-12 7:55 GMT+04:00 tobe <to...@gmail.com>:
> > >
> > > > Most of our users migrated their data form MySQL to HBase. Before
> they
> > > > totally trust HBase, they use MySQL and HBase at the same time.
> > Sometimes
> > > > the data is inconsistent because they use it incorrectly or maybe
> > > there're
> > > > bugs of HBase. Anyway, we have to make sure the data from MySQL and
> > HBase
> > > > is consistent.
> > > >
> > > > So how can we do that? Write a simple script or is there any general
> > > > method?
> > > >
> > >
> >
>

Re: How to verify data from MySQL and HBase

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Tobe,

Thing is, your data in HBase might be organize very differently than in
MySQL. Have you denormlized some of it? Have you used some Avro containers
into an HBase cell? Have you do any cleanup? Or enrichment? At the end, it
my be very different that what is stored into MySQL. There is not any easy
process to validate that migration as been done correctly between the two
databases because tools don't know the transformations you applied. I think
you will have to build someone internally which will do this validation
because not such tool exist today.

You can use the row counters to count the rows, but will that mean content
is correct? No. Calculate a CRC on the cells value? Not even, because you
might have denormalized.

Sorry, but you will have to do some coding here I think.

JM

JM


2014-08-12 3:49 GMT-04:00 tobe <to...@gmail.com>:

> Thanks for replaying. @Serega
>
> I know MySQL and HBase are reliable. What I want to validate is the date in
> both MySQL and HBase. The upper-stream application and unexpected operation
> may make it inconsistent.
>
> I'm also wondering could sqoop validate the values from each database?
> There's RowCountValidator but it's not suitable for us.
>
>
> On Tue, Aug 12, 2014 at 3:23 PM, Serega Sheypak <se...@gmail.com>
> wrote:
>
> > you should design resilient ETL-processes. Also introduce post-ETL
> checks.
> > There is no need to test MySQL or HBase. They are already tested.
> > See this:
> >
> http://www.slideshare.net/wyaddow/data-verification-in-qa-department-final
> > Pretty old, but gives basic ideas. Nothing changed from that time.
> >
> >
> > 2014-08-12 7:55 GMT+04:00 tobe <to...@gmail.com>:
> >
> > > Most of our users migrated their data form MySQL to HBase. Before they
> > > totally trust HBase, they use MySQL and HBase at the same time.
> Sometimes
> > > the data is inconsistent because they use it incorrectly or maybe
> > there're
> > > bugs of HBase. Anyway, we have to make sure the data from MySQL and
> HBase
> > > is consistent.
> > >
> > > So how can we do that? Write a simple script or is there any general
> > > method?
> > >
> >
>

Re: How to verify data from MySQL and HBase

Posted by tobe <to...@gmail.com>.
Thanks for replaying. @Serega

I know MySQL and HBase are reliable. What I want to validate is the date in
both MySQL and HBase. The upper-stream application and unexpected operation
may make it inconsistent.

I'm also wondering could sqoop validate the values from each database?
There's RowCountValidator but it's not suitable for us.


On Tue, Aug 12, 2014 at 3:23 PM, Serega Sheypak <se...@gmail.com>
wrote:

> you should design resilient ETL-processes. Also introduce post-ETL checks.
> There is no need to test MySQL or HBase. They are already tested.
> See this:
> http://www.slideshare.net/wyaddow/data-verification-in-qa-department-final
> Pretty old, but gives basic ideas. Nothing changed from that time.
>
>
> 2014-08-12 7:55 GMT+04:00 tobe <to...@gmail.com>:
>
> > Most of our users migrated their data form MySQL to HBase. Before they
> > totally trust HBase, they use MySQL and HBase at the same time. Sometimes
> > the data is inconsistent because they use it incorrectly or maybe
> there're
> > bugs of HBase. Anyway, we have to make sure the data from MySQL and HBase
> > is consistent.
> >
> > So how can we do that? Write a simple script or is there any general
> > method?
> >
>

Re: How to verify data from MySQL and HBase

Posted by Serega Sheypak <se...@gmail.com>.
you should design resilient ETL-processes. Also introduce post-ETL checks.
There is no need to test MySQL or HBase. They are already tested.
See this:
http://www.slideshare.net/wyaddow/data-verification-in-qa-department-final
Pretty old, but gives basic ideas. Nothing changed from that time.


2014-08-12 7:55 GMT+04:00 tobe <to...@gmail.com>:

> Most of our users migrated their data form MySQL to HBase. Before they
> totally trust HBase, they use MySQL and HBase at the same time. Sometimes
> the data is inconsistent because they use it incorrectly or maybe there're
> bugs of HBase. Anyway, we have to make sure the data from MySQL and HBase
> is consistent.
>
> So how can we do that? Write a simple script or is there any general
> method?
>