You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Alexander Pivovarov <ap...@gmail.com> on 2015/04/27 22:27:06 UTC

How to compare data in two tables?

Hi Everyone

Lets say I have hive table in 2 datacenters. Table format can be textfile
or Orc.
There is scoop job running every day which adds data to the table.

Each datacenter has its own instance of scoop job.
In Ideal case scenario the data in these two table should be the same.

The same means that row count is the same and tables contain the same rows.
However row order can be different. number of files and their size also can
be different.

Is there a way to scan the table and get some hashcode which can be used to
compare tables?

Thank you
Alex

RE: How to compare data in two tables?

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

OK so we have an  Oracle sequence as the PK. That sequence is monotonically increasing number so each record will have its own sequence. If you do sum(sequence_col) for each Hive table then the sum should agree. That means no row is missing.

Now with regard to the rows to be the same the hashcode looks good as long as you order the hashcode for each row in the order of sequence_col (in theory rows should be in that order). However, if you have a situation where strict ordering in Hive table is not maintained then sum hashcodes for all rows with sum(sequence_col) should be good enough.

HTH

Mich Talebzadeh

http://talebzadehmich.wordpress.com

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

From: Alexander Pivovarov [mailto:apivovarov@gmail.com] 
Sent: 27 April 2015 22:05
To: user@hive.apache.org
Subject: RE: How to compare data in two tables?

Golden source is Oracle DB.

Ihave two cases:

1. Tables are overwritten completly  every day.

2. Tables are incrementally loaded. PK is auto incremented number in Oracle.

What you think if I concat all cells of a row to a string. Get int hashcode from the string.
And then sum hashcodes to get a final number for a table.

On Apr 27, 2015 1:45 PM, "Mich Talebzadeh" <mi...@peridale.co.uk> wrote:
>
> Hi Alex,
>
>  
>
> Am I correct that the source of data resides in a relational table and that table has all the data already (the golden source) sent to both instances of Hive? Is the data in Hive added incrementally daily with “operation timestamp”  for each record? Also do you have a unique identifier for each row in each table?
>
>  
>
> HTH
>
>  
>
> Mich Talebzadeh
>
>  
>
> http://talebzadehmich.wordpress.com
>
>  
>
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
>
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
>
> Publications due shortly:
>
> Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache
>
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>
>  
>
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>
>  
>
> From: Alexander Pivovarov [mailto:apivovarov@gmail.com] 
> Sent: 27 April 2015 21:27
> To: user@hive.apache.org
> Subject: How to compare data in two tables?
>
>  
>
> Hi Everyone
>
> Lets say I have hive table in 2 datacenters. Table format can be textfile or Orc.
>
> There is scoop job running every day which adds data to the table.
>
> Each datacenter has its own instance of scoop job.
>
> In Ideal case scenario the data in these two table should be the same.
>
>
> The same means that row count is the same and tables contain the same rows.
>
> However row order can be different. number of files and their size also can be different.
>
>  
>
> Is there a way to scan the table and get some hashcode which can be used to compare tables?
>
> Thank you
>
> Alex

RE: How to compare data in two tables?

Posted by Alexander Pivovarov <ap...@gmail.com>.

Golden source is Oracle DB.

Ihave two cases:

1. Tables are overwritten completly  every day.

2. Tables are incrementally loaded. PK is auto incremented number in Oracle.

What you think if I concat all cells of a row to a string. Get int hashcode
from the string.
And then sum hashcodes to get a final number for a table.


On Apr 27, 2015 1:45 PM, "Mich Talebzadeh" <mi...@peridale.co.uk> wrote:
>
> Hi Alex,
>
>
>
> Am I correct that the source of data resides in a relational table and
that table has all the data already (the golden source) sent to both
instances of Hive? Is the data in Hive added incrementally daily with
“operation timestamp”  for each record? Also do you have a unique
identifier for each row in each table?
>
>
>
> HTH
>
>
>
> Mich Talebzadeh
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE
15", ISBN 978-0-9563693-0-7.
>
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
>
> Publications due shortly:
>
> Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache
>
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4,
volume one out shortly
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.
>
>
>
> From: Alexander Pivovarov [mailto:apivovarov@gmail.com]
> Sent: 27 April 2015 21:27
> To: user@hive.apache.org
> Subject: How to compare data in two tables?
>
>
>
> Hi Everyone
>
> Lets say I have hive table in 2 datacenters. Table format can be textfile
or Orc.
>
> There is scoop job running every day which adds data to the table.
>
> Each datacenter has its own instance of scoop job.
>
> In Ideal case scenario the data in these two table should be the same.
>
>
> The same means that row count is the same and tables contain the same
rows.
>
> However row order can be different. number of files and their size also
can be different.
>
>
>
> Is there a way to scan the table and get some hashcode which can be used
to compare tables?
>
> Thank you
>
> Alex

RE: How to compare data in two tables?

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Hi Alex,

 

Am I correct that the source of data resides in a relational table and that table has all the data already (the golden source) sent to both instances of Hive? Is the data in Hive added incrementally daily with “operation timestamp”  for each record? Also do you have a unique identifier for each row in each table? 

 

HTH

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Alexander Pivovarov [mailto:apivovarov@gmail.com] 
Sent: 27 April 2015 21:27
To: user@hive.apache.org
Subject: How to compare data in two tables?

 

Hi Everyone

Lets say I have hive table in 2 datacenters. Table format can be textfile or Orc.

There is scoop job running every day which adds data to the table.

Each datacenter has its own instance of scoop job.

In Ideal case scenario the data in these two table should be the same.


The same means that row count is the same and tables contain the same rows.

However row order can be different. number of files and their size also can be different.

 

Is there a way to scan the table and get some hashcode which can be used to compare tables?

Thank you

Alex