You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Divesh Jain (JIRA)" <ji...@apache.org> on 2018/07/25 16:47:00 UTC

[jira] [Comment Edited] (HBASE-11715) HBase should provide a tool to compare 2 remote tables.

    [ https://issues.apache.org/jira/browse/HBASE-11715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555943#comment-16555943 ] 

Divesh Jain edited comment on HBASE-11715 at 7/25/18 4:46 PM:
--------------------------------------------------------------

How about introducing and using a new tool that uses the following approach:

1) Takes an HBase snapshot of the table on both the clusters.

2) Launches a map reduce job to calculate the total number of row keys, min key, max key and hash sum on both cluster.

3) Compare the above values with the remote cluster to figure out if the data is same or not.

4) Delete HBase snapshot taken in step 1.

Since only a few bytes worth of data is transferred, the tool will perform faster than VerifyReplicatedData. We can also look at extending the functionality of VerifyReplicatedData to operate on fast mode using the above logic. 

* The tool can also be configurable enough to accept start and end row keys, enabling/disabling raw scan, accept start time and end time range, etc.



> HBase should provide a tool to compare 2 remote tables.
> -------------------------------------------------------
>
>                 Key: HBASE-11715
>                 URL: https://issues.apache.org/jira/browse/HBASE-11715
>             Project: HBase
>          Issue Type: New Feature
>          Components: util
>            Reporter: Jean-Marc Spaggiari
>            Priority: Major
>
> As discussed in the mailing list, when a table is copied to another cluster and need to be validated against the first one, only VerifyReplication can be used. However, this can be very long since data need to be copied again.
> We should provide an easier and faster way to compare the tables. 
> One option is to calculate hashs per ranges. User can define number of buckets, then we split the table into this number of buckets and calculate an hash for each (Like partitioner is already doing). We can also optionally calculate an overall CRC to reduce even more hash collision. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)