You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Colin Kincaid Williams <di...@uw.edu> on 2014/08/07 23:00:37 UTC

Large discrepancy in hdfs hbase rootdir size after copytable operation.

I used the copy table command to copy a database between the original
cluster A and a new cluster B. I have noticed that the rootdir is larger
than 2X the size of the original. I am trying to account for such a large
difference. The following are some details about the table.


I'm trying to figure out why my copied table is more than 2X the size of
the original table. Could the bloomfilter itself account for this?

The guide I used as a reference:
http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters



Supposedly the original command used to create the table on cluster A:

create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS => '1',
COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}


How I created the target table on cluster B:

create 'ADMd5','a',{



BLOOMFILTER => 'ROW',
VERSIONS => '1',
COMPRESSION => 'SNAPPY',
MIN_VERSIONS => '0',
SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
'/zyuFR1VmhJyF4rbWsFnEg==',
'0sZYnBd83ul58d1O8I2JnA==',
'2+03N7IicZH3ltrqZUX6kQ==',
'4+/slRQtkBDU7Px6C9MAbg==',
'6+1dGCQ/IBrCsrNQXe/9xQ==',
'7+2pvtpHUQHWkZJoouR9wQ==',
'8+4n2deXhzmrpe//2Fo6Fg==',
'9+4SKW/BmNzpL68cXwKV1Q==',
'A+4ajStFkjEMf36cX5D9xg==',
'B+6Zm6Kccb3l6iM2L0epxQ==',
'C+6lKKDiOWl5qrRn72fNCw==',
'D+6dZMyn7m+NhJ7G07gqaw==',
'E+6BrimmrpAd92gZJ5hyMw==',
'G+5tisu4xWZMOJnDHeYBJg==',
'I+7fRy4dvqcM/L6dFRQk9g==',
'J+8ECMw1zeOyjfOg/ypXJA==',
'K+7tenLYn6a1aNLniL6tbg==']}


How the tables now appear in hbase shell:

table A:

describe 'ADMd5'
DESCRIPTION

  ENABLED

 {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VER
true

 SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>
'false', BLOCKCACHE => 'true'}]}


1 row(s) in 0.0370 seconds


table B:

hbase(main):003:0> describe 'ADMd5'
DESCRIPTION

  ENABLED

 {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'ROW',
REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY', MIN_VE
true

 RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>
'false', BLOCKCACHE => 'true'}]}


1 row(s) in 0.0280 seconds



The containing foldersize in hdfs:
table A:
sudo -u hdfs hadoop fs -dus -h /a_d
dus: DEPRECATED: Please use 'du -s' instead.
227.4g  /a_d

table B:
sudo -u hdfs hadoop fs -dus -h /a_d
dus: DEPRECATED: Please use 'du -s' instead.
501.0g  /a_d


https://gist.github.com/drocsid/80bba7b6b19d64fde6c2

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Colin Kincaid Williams <di...@uw.edu>.
The correct syntax is :

create 'ADMd5',{
NAME => 'a',
VERSIONS => '1',
COMPRESSION => 'SNAPPY',
BLOOMFILTER => 'ROW',
},
{
SPLITS => ['/++ASUZm4u7YsTcF/VtK6Q==',
'/zyuFR1VmhJyF4rbWsFnEg==',
'0sZYnBd83ul58d1O8I2JnA==',
'2+03N7IicZH3ltrqZUX6kQ==',
'4+/slRQtkBDU7Px6C9MAbg==',
'6+1dGCQ/IBrCsrNQXe/9xQ==',
'7+2pvtpHUQHWkZJoouR9wQ==',
'8+4n2deXhzmrpe//2Fo6Fg==',
'9+4SKW/BmNzpL68cXwKV1Q==',
'A+4ajStFkjEMf36cX5D9xg==',
'B+6Zm6Kccb3l6iM2L0epxQ==',
'C+6lKKDiOWl5qrRn72fNCw==',
'D+6dZMyn7m+NhJ7G07gqaw==',
'E+6BrimmrpAd92gZJ5hyMw==',
'G+5tisu4xWZMOJnDHeYBJg==',
'I+7fRy4dvqcM/L6dFRQk9g==',
'J+8ECMw1zeOyjfOg/ypXJA==',
'K+7tenLYn6a1aNLniL6tbg==',]
}



On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> I have discovered the error. I made the mistake regarding the compression
> and the bloom filter. The new table doesn't have them enabled, and the old
> does. However I'm wondering how I can create tables with splits and bf and
> compression enabled. Shouldn't the following command return an error?
>
> hbase(main):001:0> create 'ADMd5','a',{
>
> hbase(main):002:1* BLOOMFILTER => 'ROW',
> hbase(main):003:1* VERSIONS => '1',
> hbase(main):004:1* COMPRESSION => 'SNAPPY',
> hbase(main):005:1* MIN_VERSIONS => '0',
> hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> 0 row(s) in 1.8010 seconds
>
> hbase(main):024:0> describe 'ADMd5'
> DESCRIPTION                                        ENABLED
>
>  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
>
>  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
>
>  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
>
>  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
>
>  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
>
> 1 row(s) in 0.0420 seconds
>
>
>
> On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> Hi Colin,
>>
>> Just to make sure.
>>
>> Is table A from the source cluster and not compressed, and table B in the
>> destination cluster and SNAPPY compressed? Is that correct? Then ratio
>> should be the opposite. Are you able to du -h from hadoop to see if all
>> regions are evenly bigger or if anything else is wrong?
>>
>>
>> 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <di...@uw.edu>:
>>
>> > I haven't yet tried to major compact table B. I will look up some
>> > documentation on WALs and snapshots to find this information in the hdfs
>> > filesystem tomorrow. Could it be caused by the bloomfilter existing on
>> > table B, but not table A? The funny thing is the source table is smaller
>> > than the destination.
>> >
>> >
>> > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <esteban@cloudera.com
>> >
>> > wrote:
>> >
>> > > Hi Colin,
>> > >
>> > > Have you verified if the content of /a_d includes WALs and/or the
>> content
>> > > of the snapshots or the HBase archive? have you tried to major compact
>> > > table B?  does it makes any difference?
>> > >
>> > > regards,
>> > > esteban.
>> > >
>> > >
>> > >
>> > > --
>> > > Cloudera, Inc.
>> > >
>> > >
>> > >
>> > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
>> discord@uw.edu>
>> > > wrote:
>> > >
>> > > > I used the copy table command to copy a database between the
>> original
>> > > > cluster A and a new cluster B. I have noticed that the rootdir is
>> > larger
>> > > > than 2X the size of the original. I am trying to account for such a
>> > large
>> > > > difference. The following are some details about the table.
>> > > >
>> > > >
>> > > > I'm trying to figure out why my copied table is more than 2X the
>> size
>> > of
>> > > > the original table. Could the bloomfilter itself account for this?
>> > > >
>> > > > The guide I used as a reference:
>> > > >
>> > > >
>> > >
>> >
>> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
>> > > >
>> > > >
>> > > >
>> > > > Supposedly the original command used to create the table on cluster
>> A:
>> > > >
>> > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS => '1',
>> > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
>> > > >
>> > > >
>> > > > How I created the target table on cluster B:
>> > > >
>> > > > create 'ADMd5','a',{
>> > > >
>> > > >
>> > > >
>> > > > BLOOMFILTER => 'ROW',
>> > > > VERSIONS => '1',
>> > > > COMPRESSION => 'SNAPPY',
>> > > > MIN_VERSIONS => '0',
>> > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
>> > > > '/zyuFR1VmhJyF4rbWsFnEg==',
>> > > > '0sZYnBd83ul58d1O8I2JnA==',
>> > > > '2+03N7IicZH3ltrqZUX6kQ==',
>> > > > '4+/slRQtkBDU7Px6C9MAbg==',
>> > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
>> > > > '7+2pvtpHUQHWkZJoouR9wQ==',
>> > > > '8+4n2deXhzmrpe//2Fo6Fg==',
>> > > > '9+4SKW/BmNzpL68cXwKV1Q==',
>> > > > 'A+4ajStFkjEMf36cX5D9xg==',
>> > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
>> > > > 'C+6lKKDiOWl5qrRn72fNCw==',
>> > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
>> > > > 'E+6BrimmrpAd92gZJ5hyMw==',
>> > > > 'G+5tisu4xWZMOJnDHeYBJg==',
>> > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
>> > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
>> > > > 'K+7tenLYn6a1aNLniL6tbg==']}
>> > > >
>> > > >
>> > > > How the tables now appear in hbase shell:
>> > > >
>> > > > table A:
>> > > >
>> > > > describe 'ADMd5'
>> > > > DESCRIPTION
>> > > >
>> > > >   ENABLED
>> > > >
>> > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'NONE',
>> > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
>> > MIN_VER
>> > > > true
>> > > >
>> > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
>> IN_MEMORY =>
>> > > > 'false', BLOCKCACHE => 'true'}]}
>> > > >
>> > > >
>> > > > 1 row(s) in 0.0370 seconds
>> > > >
>> > > >
>> > > > table B:
>> > > >
>> > > > hbase(main):003:0> describe 'ADMd5'
>> > > > DESCRIPTION
>> > > >
>> > > >   ENABLED
>> > > >
>> > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'ROW',
>> > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY',
>> > > MIN_VE
>> > > > true
>> > > >
>> > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
>> IN_MEMORY =>
>> > > > 'false', BLOCKCACHE => 'true'}]}
>> > > >
>> > > >
>> > > > 1 row(s) in 0.0280 seconds
>> > > >
>> > > >
>> > > >
>> > > > The containing foldersize in hdfs:
>> > > > table A:
>> > > > sudo -u hdfs hadoop fs -dus -h /a_d
>> > > > dus: DEPRECATED: Please use 'du -s' instead.
>> > > > 227.4g  /a_d
>> > > >
>> > > > table B:
>> > > > sudo -u hdfs hadoop fs -dus -h /a_d
>> > > > dus: DEPRECATED: Please use 'du -s' instead.
>> > > > 501.0g  /a_d
>> > > >
>> > > >
>> > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
>> > > >
>> > >
>> >
>>
>
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Ted Yu <yu...@gmail.com>.
I used 0.98.5 RC0

FYI


On Fri, Aug 8, 2014 at 1:16 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> Not in the hbase shell I have:
>
> hbase version
> 14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
> 14/08/08 14:16:08 INFO util.VersionInfo: Subversion
>
> file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
> -r Unknown
> 14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat Jan 26
> 17:11:38 PST 2013
>
>
>
> On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Using simplified version of your command, I saw the following in shell
> > output (you may have noticed as well):
> >
> > An argument ignored (unknown or overridden): BLOOMFILTER
> > An argument ignored (unknown or overridden): VERSIONS
> > 0 row(s) in 2.1110 seconds
> >
> > Cheers
> >
> >
> > On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <di...@uw.edu>
> > wrote:
> >
> > > I have discovered the error. I made the mistake regarding the
> compression
> > > and the bloom filter. The new table doesn't have them enabled, and the
> > old
> > > does. However I'm wondering how I can create tables with splits and bf
> > and
> > > compression enabled. Shouldn't the following command return an error?
> > >
> > > hbase(main):001:0> create 'ADMd5','a',{
> > >
> > > hbase(main):002:1* BLOOMFILTER => 'ROW',
> > > hbase(main):003:1* VERSIONS => '1',
> > > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> > > hbase(main):005:1* MIN_VERSIONS => '0',
> > > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> > > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> > > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> > > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> > > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> > > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> > > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> > > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> > > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> > > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> > > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> > > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> > > 0 row(s) in 1.8010 seconds
> > >
> > > hbase(main):024:0> describe 'ADMd5'
> > > DESCRIPTION                                        ENABLED
> > >
> > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> > >
> > >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> > >
> > >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> > >
> > >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> > >
> > >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> > >
> > > 1 row(s) in 0.0420 seconds
> > >
> > >
> > >
> > > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org
> > > > wrote:
> > >
> > > > Hi Colin,
> > > >
> > > > Just to make sure.
> > > >
> > > > Is table A from the source cluster and not compressed, and table B in
> > the
> > > > destination cluster and SNAPPY compressed? Is that correct? Then
> ratio
> > > > should be the opposite. Are you able to du -h from hadoop to see if
> all
> > > > regions are evenly bigger or if anything else is wrong?
> > > >
> > > >
> > > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <di...@uw.edu>:
> > > >
> > > > > I haven't yet tried to major compact table B. I will look up some
> > > > > documentation on WALs and snapshots to find this information in the
> > > hdfs
> > > > > filesystem tomorrow. Could it be caused by the bloomfilter existing
> > on
> > > > > table B, but not table A? The funny thing is the source table is
> > > smaller
> > > > > than the destination.
> > > > >
> > > > >
> > > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> > > esteban@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Colin,
> > > > > >
> > > > > > Have you verified if the content of /a_d includes WALs and/or the
> > > > content
> > > > > > of the snapshots or the HBase archive? have you tried to major
> > > compact
> > > > > > table B?  does it makes any difference?
> > > > > >
> > > > > > regards,
> > > > > > esteban.
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Cloudera, Inc.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
> > > discord@uw.edu
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I used the copy table command to copy a database between the
> > > original
> > > > > > > cluster A and a new cluster B. I have noticed that the rootdir
> is
> > > > > larger
> > > > > > > than 2X the size of the original. I am trying to account for
> > such a
> > > > > large
> > > > > > > difference. The following are some details about the table.
> > > > > > >
> > > > > > >
> > > > > > > I'm trying to figure out why my copied table is more than 2X
> the
> > > size
> > > > > of
> > > > > > > the original table. Could the bloomfilter itself account for
> > this?
> > > > > > >
> > > > > > > The guide I used as a reference:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Supposedly the original command used to create the table on
> > cluster
> > > > A:
> > > > > > >
> > > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS =>
> > > '1',
> > > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > > > > > >
> > > > > > >
> > > > > > > How I created the target table on cluster B:
> > > > > > >
> > > > > > > create 'ADMd5','a',{
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > BLOOMFILTER => 'ROW',
> > > > > > > VERSIONS => '1',
> > > > > > > COMPRESSION => 'SNAPPY',
> > > > > > > MIN_VERSIONS => '0',
> > > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> > > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > > > > > >
> > > > > > >
> > > > > > > How the tables now appear in hbase shell:
> > > > > > >
> > > > > > > table A:
> > > > > > >
> > > > > > > describe 'ADMd5'
> > > > > > > DESCRIPTION
> > > > > > >
> > > > > > >   ENABLED
> > > > > > >
> > > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> > > 'NONE',
> > > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION =>
> 'NONE',
> > > > > MIN_VER
> > > > > > > true
> > > > > > >
> > > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> > IN_MEMORY
> > > > =>
> > > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > > > >
> > > > > > >
> > > > > > > 1 row(s) in 0.0370 seconds
> > > > > > >
> > > > > > >
> > > > > > > table B:
> > > > > > >
> > > > > > > hbase(main):003:0> describe 'ADMd5'
> > > > > > > DESCRIPTION
> > > > > > >
> > > > > > >   ENABLED
> > > > > > >
> > > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> > 'ROW',
> > > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>
> > 'SNAPPY',
> > > > > > MIN_VE
> > > > > > > true
> > > > > > >
> > > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> > > IN_MEMORY
> > > > =>
> > > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > > > >
> > > > > > >
> > > > > > > 1 row(s) in 0.0280 seconds
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The containing foldersize in hdfs:
> > > > > > > table A:
> > > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > > > 227.4g  /a_d
> > > > > > >
> > > > > > > table B:
> > > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > > > 501.0g  /a_d
> > > > > > >
> > > > > > >
> > > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Esteban Gutierrez <es...@cloudera.com>.
Hi Colin,

Would be possible for you to share the command line arguments that you used
to run CopyTable and also if your source table is static at the time you
run the CopyTable command (e.g. no Puts or Deletes while the CopyTable job
is running) have you seen failures in the tasks even if the job
is successful? How are you validating that there is missing data after you
run the CopyTable command?

thanks,
esteban.




--
Cloudera, Inc.



On Tue, Aug 12, 2014 at 7:01 PM, tobe <to...@gmail.com> wrote:

> Hi Colin,
>
> Does your table contain some really large rows?
>
> There're some errors when I copy a table with the rows which have 400K
> columns. I have not tested the content but I'm shocked when you said you
> were missing data with CopyTable.
>
>
> On Wed, Aug 13, 2014 at 9:00 AM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
> > it appears that there is a bug in the copytable operation. We are
> missing a
> > large amount of data after copying between clusters. I don't know if I
> can
> > provide a sample data set or not, but I can try to dig up some details.
> >  One of our developers rewrote the operation using another library, and
> is
> > testing his copy now.
> >
> >
> > On Sun, Aug 10, 2014 at 11:24 PM, anil gupta <an...@gmail.com>
> > wrote:
> >
> > > Hi Colin,
> > >
> > > We also faced the scenario where after copying Table "A" from cluster
> to
> > 1
> > > to 2. Size of hdfs files between clusters was not equal. We also
> assumed
> > > that it should be equal. Hence we ran verifyRep job.
> > > I don't know whats the reason behind this discrepancy but i just wanted
> > to
> > > share this so that you are aware that you are not the only one facing
> > this.
> > >
> > > ~Anil
> > >
> > >
> > > On Sun, Aug 10, 2014 at 2:21 PM, Colin Kincaid Williams <
> discord@uw.edu>
> > > wrote:
> > >
> > > > By the way I have copied the table across clusters, with the tables
> > > > configured the same. the source cluster has an underlying ext2
> > > filesystem,
> > > > while the dest cluster had an underlying ext4 filesystem. The counts
> > are
> > > > the same for the tables. Will the filesystem account for the
> difference
> > > in
> > > > directory size?
> > > >
> > > > [root@clusterA_ext2 ~]# sudo -u hdfs hadoop fs -dus -h /a_d/
> > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > 225.9g  /a_d
> > > >
> > > >
> > > > [root@clusterB_ext4 ~]#  sudo -u hdfs hadoop fs -dus -h /a_d/
> > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > 172.8g  /a_d
> > > >
> > > >
> > > >
> > > > On Sun, Aug 10, 2014 at 4:17 AM, Jean-Marc Spaggiari <
> > > > jean-marc@spaggiari.org> wrote:
> > > >
> > > > > HBASE-11715 <https://issues.apache.org/jira/browse/HBASE-11715>
> > > opened.
> > > > >
> > > > >
> > > > > 2014-08-10 7:12 GMT-04:00 Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org
> > > >:
> > > > >
> > > > > > +1 too for a tool to produce a hash of a table. Like, one hash
> per
> > > > > region,
> > > > > > or as Lars said, one hash per range. You define the number of
> > buckets
> > > > you
> > > > > > want, run the MR job, which produce a list of hash, and compare
> > that
> > > > from
> > > > > > the 2 clusters. Might be pretty simple to do. The more buckets
> you
> > > > > define,
> > > > > > the less risk you have to have a hash collision. We can even
> have a
> > > > > global
> > > > > > hash and one hash per bucket, and other options...
> > > > > >
> > > > > >
> > > > > > 2014-08-10 1:59 GMT-04:00 anil gupta <an...@gmail.com>:
> > > > > >
> > > > > > +1 for MerkleTree or Range Hash based implementation. We had a
> > table
> > > > > with 1
> > > > > >> Billion records. We ran verifyRep for that table across two Data
> > > > Centers
> > > > > >> and it took close to 1 week to finish. It seems at present,
> > > VerifyRep
> > > > > >> comapres every row byte by byte.
> > > > > >>
> > > > > >>
> > > > > >> On Sat, Aug 9, 2014 at 6:11 PM, lars hofhansl <larsh@apache.org
> >
> > > > wrote:
> > > > > >>
> > > > > >> > VerifyReplication is something you could use. It's not
> > replication
> > > > > >> > specific, just named that way because it was initially
> conceived
> > > as
> > > > a
> > > > > >> tool
> > > > > >> > to verify that replication is working correctly. Unfortunately
> > it
> > > > will
> > > > > >> need
> > > > > >> > to ship all data from the remote cluster, which is quite
> > > > inefficient.
> > > > > >> > I think we should include a better way with HBase, maybe using
> > > > > >> > Merkletrees, or at least hashes of ranges, and compare those.
> > > > > >> >
> > > > > >> > -- Lars
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > ________________________________
> > > > > >> >  From: Colin Kincaid Williams <di...@uw.edu>
> > > > > >> > To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> > > > > >> > Sent: Saturday, August 9, 2014 2:28 PM
> > > > > >> > Subject: Re: Large discrepancy in hdfs hbase rootdir size
> after
> > > > > >> copytable
> > > > > >> > operation.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > Hi Everybody,
> > > > > >> >
> > > > > >> > I do wish to upgrade to a more recent hbase soon. However the
> > > choice
> > > > > >> isn't
> > > > > >> > entirely mine. Does anybody know how to verify the contents
> > > between
> > > > > >> tables
> > > > > >> > across clusters after a copytable operation?
> > > > > >> > I see replication.VerifyReplication , but that seems
> replication
> > > > > >> specific.
> > > > > >> > Maybe I should have began with replication in the first
> place...
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <
> larsh@apache.org
> > >
> > > > > wrote:
> > > > > >> >
> > > > > >> > Hi Colin,
> > > > > >> > >
> > > > > >> > >you might want to consider upgrading. The current stable
> > version
> > > is
> > > > > >> > 0.98.4 (soon .5).
> > > > > >> > >
> > > > > >> > >Even just going to 0.94 will give a lot of new features,
> > > stability,
> > > > > and
> > > > > >> > performance.
> > > > > >> > >0.92.x can be upgraded to 0.94.x without any downtime and
> > without
> > > > any
> > > > > >> > upgrade steps necessary.
> > > > > >> > >For an upgrade to 0.98 and later you'd need some downtime and
> > > also
> > > > > >> excute
> > > > > >> > an upgrade step.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >-- Lars
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >----- Original Message -----
> > > > > >> > >From: Colin Kincaid Williams <di...@uw.edu>
> > > > > >> > >To: user@hbase.apache.org
> > > > > >> > >Cc:
> > > > > >> > >Sent: Friday, August 8, 2014 1:16 PM
> > > > > >> > >Subject: Re: Large discrepancy in hdfs hbase rootdir size
> after
> > > > > >> copytable
> > > > > >> > operation.
> > > > > >> > >
> > > > > >> > >Not in the hbase shell I have:
> > > > > >> > >
> > > > > >> > >hbase version
> > > > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: HBase
> 0.92.1-cdh4.1.3
> > > > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Subversion
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> >file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
> > > > > >> > >-r Unknown
> > > > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins
> on
> > > Sat
> > > > > >> Jan 26
> > > > > >> > >17:11:38 PST 2013
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yuzhihong@gmail.com
> >
> > > > wrote:
> > > > > >> > >
> > > > > >> > >> Using simplified version of your command, I saw the
> following
> > > in
> > > > > >> shell
> > > > > >> > >> output (you may have noticed as well):
> > > > > >> > >>
> > > > > >> > >> An argument ignored (unknown or overridden): BLOOMFILTER
> > > > > >> > >> An argument ignored (unknown or overridden): VERSIONS
> > > > > >> > >> 0 row(s) in 2.1110 seconds
> > > > > >> > >>
> > > > > >> > >> Cheers
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <
> > > > > >> discord@uw.edu
> > > > > >> > >
> > > > > >> > >> wrote:
> > > > > >> > >>
> > > > > >> > >> > I have discovered the error. I made the mistake regarding
> > the
> > > > > >> > compression
> > > > > >> > >> > and the bloom filter. The new table doesn't have them
> > > enabled,
> > > > > and
> > > > > >> the
> > > > > >> > >> old
> > > > > >> > >> > does. However I'm wondering how I can create tables with
> > > splits
> > > > > >> and bf
> > > > > >> > >> and
> > > > > >> > >> > compression enabled. Shouldn't the following command
> return
> > > an
> > > > > >> error?
> > > > > >> > >> >
> > > > > >> > >> > hbase(main):001:0> create 'ADMd5','a',{
> > > > > >> > >> >
> > > > > >> > >> > hbase(main):002:1* BLOOMFILTER => 'ROW',
> > > > > >> > >> > hbase(main):003:1* VERSIONS => '1',
> > > > > >> > >> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> > > > > >> > >> > hbase(main):005:1* MIN_VERSIONS => '0',
> > > > > >> > >> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > > > >> > >> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> > > > > >> > >> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> > > > > >> > >> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> > > > > >> > >> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> > > > > >> > >> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > > > >> > >> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> > > > > >> > >> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> > > > > >> > >> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> > > > > >> > >> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> > > > > >> > >> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > > > >> > >> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> > > > > >> > >> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > > > >> > >> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> > > > > >> > >> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> > > > > >> > >> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > > > >> > >> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > > > >> > >> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> > > > > >> > >> > 0 row(s) in 1.8010 seconds
> > > > > >> > >> >
> > > > > >> > >> > hbase(main):024:0> describe 'ADMd5'
> > > > > >> > >> > DESCRIPTION
>  ENABLED
> > > > > >> > >> >
> > > > > >> > >> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> > > > > >> > >> >
> > > > > >> > >> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> > > > > >> > >> >
> > > > > >> > >> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> > > > > >> > >> >
> > > > > >> > >> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> > > > > >> > >> >
> > > > > >> > >> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> > > > > >> > >> >
> > > > > >> > >> > 1 row(s) in 0.0420 seconds
> > > > > >> > >> >
> > > > > >> > >> >
> > > > > >> > >> >
> > > > > >> > >> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> > > > > >> > >> > jean-marc@spaggiari.org
> > > > > >> > >> > > wrote:
> > > > > >> > >> >
> > > > > >> > >> > > Hi Colin,
> > > > > >> > >> > >
> > > > > >> > >> > > Just to make sure.
> > > > > >> > >> > >
> > > > > >> > >> > > Is table A from the source cluster and not compressed,
> > and
> > > > > table
> > > > > >> B
> > > > > >> > in
> > > > > >> > >> the
> > > > > >> > >> > > destination cluster and SNAPPY compressed? Is that
> > correct?
> > > > > Then
> > > > > >> > ratio
> > > > > >> > >> > > should be the opposite. Are you able to du -h from
> hadoop
> > > to
> > > > > see
> > > > > >> if
> > > > > >> > all
> > > > > >> > >> > > regions are evenly bigger or if anything else is wrong?
> > > > > >> > >> > >
> > > > > >> > >> > >
> > > > > >> > >> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <
> > > > > >> discord@uw.edu>:
> > > > > >> > >> > >
> > > > > >> > >> > > > I haven't yet tried to major compact table B. I will
> > look
> > > > up
> > > > > >> some
> > > > > >> > >> > > > documentation on WALs and snapshots to find this
> > > > information
> > > > > in
> > > > > >> > the
> > > > > >> > >> > hdfs
> > > > > >> > >> > > > filesystem tomorrow. Could it be caused by the
> > > bloomfilter
> > > > > >> > existing
> > > > > >> > >> on
> > > > > >> > >> > > > table B, but not table A? The funny thing is the
> source
> > > > table
> > > > > >> is
> > > > > >> > >> > smaller
> > > > > >> > >> > > > than the destination.
> > > > > >> > >> > > >
> > > > > >> > >> > > >
> > > > > >> > >> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> > > > > >> > >> > esteban@cloudera.com>
> > > > > >> > >> > > > wrote:
> > > > > >> > >> > > >
> > > > > >> > >> > > > > Hi Colin,
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > Have you verified if the content of /a_d includes
> > WALs
> > > > > and/or
> > > > > >> > the
> > > > > >> > >> > > content
> > > > > >> > >> > > > > of the snapshots or the HBase archive? have you
> tried
> > > to
> > > > > >> major
> > > > > >> > >> > compact
> > > > > >> > >> > > > > table B?  does it makes any difference?
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > regards,
> > > > > >> > >> > > > > esteban.
> > > > > >> > >> > > > >
> > > > > >> > >> > > > >
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > --
> > > > > >> > >> > > > > Cloudera, Inc.
> > > > > >> > >> > > > >
> > > > > >> > >> > > > >
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid
> > Williams
> > > <
> > > > > >> > >> > discord@uw.edu
> > > > > >> > >> > > >
> > > > > >> > >> > > > > wrote:
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > > I used the copy table command to copy a database
> > > > between
> > > > > >> the
> > > > > >> > >> > original
> > > > > >> > >> > > > > > cluster A and a new cluster B. I have noticed
> that
> > > the
> > > > > >> > rootdir is
> > > > > >> > >> > > > larger
> > > > > >> > >> > > > > > than 2X the size of the original. I am trying to
> > > > account
> > > > > >> for
> > > > > >> > >> such a
> > > > > >> > >> > > > large
> > > > > >> > >> > > > > > difference. The following are some details about
> > the
> > > > > table.
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > I'm trying to figure out why my copied table is
> > more
> > > > than
> > > > > >> 2X
> > > > > >> > the
> > > > > >> > >> > size
> > > > > >> > >> > > > of
> > > > > >> > >> > > > > > the original table. Could the bloomfilter itself
> > > > account
> > > > > >> for
> > > > > >> > >> this?
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > The guide I used as a reference:
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > >
> > > > > >> > >> > > >
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> > >>
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > Supposedly the original command used to create
> the
> > > > table
> > > > > on
> > > > > >> > >> cluster
> > > > > >> > >> > > A:
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER =>
> 'ROW',
> > > > > >> VERSIONS
> > > > > >> > =>
> > > > > >> > >> > '1',
> > > > > >> > >> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > How I created the target table on cluster B:
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > create 'ADMd5','a',{
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > BLOOMFILTER => 'ROW',
> > > > > >> > >> > > > > > VERSIONS => '1',
> > > > > >> > >> > > > > > COMPRESSION => 'SNAPPY',
> > > > > >> > >> > > > > > MIN_VERSIONS => '0',
> > > > > >> > >> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > > > >> > >> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > > > > >> > >> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> > > > > >> > >> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > > > > >> > >> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > > > > >> > >> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > > > >> > >> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > > > > >> > >> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > > > > >> > >> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > > > > >> > >> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > > > > >> > >> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > > > >> > >> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > > > > >> > >> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > > > >> > >> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > > > > >> > >> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > > > > >> > >> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > > > >> > >> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > > > >> > >> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > How the tables now appear in hbase shell:
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > table A:
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > describe 'ADMd5'
> > > > > >> > >> > > > > > DESCRIPTION
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >   ENABLED
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a',
> > > > BLOOMFILTER
> > > > > >> =>
> > > > > >> > >> > 'NONE',
> > > > > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3',
> > > COMPRESSION
> > > > =>
> > > > > >> > 'NONE',
> > > > > >> > >> > > > MIN_VER
> > > > > >> > >> > > > > > true
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> > > > '65536',
> > > > > >> > >> IN_MEMORY
> > > > > >> > >> > > =>
> > > > > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > 1 row(s) in 0.0370 seconds
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > table B:
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > hbase(main):003:0> describe 'ADMd5'
> > > > > >> > >> > > > > > DESCRIPTION
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >   ENABLED
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a',
> > > > BLOOMFILTER
> > > > > >> =>
> > > > > >> > >> 'ROW',
> > > > > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1',
> > > COMPRESSION
> > > > =>
> > > > > >> > >> 'SNAPPY',
> > > > > >> > >> > > > > MIN_VE
> > > > > >> > >> > > > > > true
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> > > > > '65536',
> > > > > >> > >> > IN_MEMORY
> > > > > >> > >> > > =>
> > > > > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > 1 row(s) in 0.0280 seconds
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > The containing foldersize in hdfs:
> > > > > >> > >> > > > > > table A:
> > > > > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > >> > >> > > > > > 227.4g  /a_d
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > > table B:
> > > > > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > >> > >> > > > > > 501.0g  /a_d
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > > >
> > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > > > > >> > >> > > > > >
> > > > > >> > >> > > > >
> > > > > >> > >> > > >
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> > >>
> > > > > >> > >
> > > > > >> > >
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Thanks & Regards,
> > > > > >> Anil Gupta
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Anil Gupta
> > >
> >
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by tobe <to...@gmail.com>.
Hi Colin,

Does your table contain some really large rows?

There're some errors when I copy a table with the rows which have 400K
columns. I have not tested the content but I'm shocked when you said you
were missing data with CopyTable.


On Wed, Aug 13, 2014 at 9:00 AM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> it appears that there is a bug in the copytable operation. We are missing a
> large amount of data after copying between clusters. I don't know if I can
> provide a sample data set or not, but I can try to dig up some details.
>  One of our developers rewrote the operation using another library, and is
> testing his copy now.
>
>
> On Sun, Aug 10, 2014 at 11:24 PM, anil gupta <an...@gmail.com>
> wrote:
>
> > Hi Colin,
> >
> > We also faced the scenario where after copying Table "A" from cluster to
> 1
> > to 2. Size of hdfs files between clusters was not equal. We also assumed
> > that it should be equal. Hence we ran verifyRep job.
> > I don't know whats the reason behind this discrepancy but i just wanted
> to
> > share this so that you are aware that you are not the only one facing
> this.
> >
> > ~Anil
> >
> >
> > On Sun, Aug 10, 2014 at 2:21 PM, Colin Kincaid Williams <di...@uw.edu>
> > wrote:
> >
> > > By the way I have copied the table across clusters, with the tables
> > > configured the same. the source cluster has an underlying ext2
> > filesystem,
> > > while the dest cluster had an underlying ext4 filesystem. The counts
> are
> > > the same for the tables. Will the filesystem account for the difference
> > in
> > > directory size?
> > >
> > > [root@clusterA_ext2 ~]# sudo -u hdfs hadoop fs -dus -h /a_d/
> > > dus: DEPRECATED: Please use 'du -s' instead.
> > > 225.9g  /a_d
> > >
> > >
> > > [root@clusterB_ext4 ~]#  sudo -u hdfs hadoop fs -dus -h /a_d/
> > > dus: DEPRECATED: Please use 'du -s' instead.
> > > 172.8g  /a_d
> > >
> > >
> > >
> > > On Sun, Aug 10, 2014 at 4:17 AM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > HBASE-11715 <https://issues.apache.org/jira/browse/HBASE-11715>
> > opened.
> > > >
> > > >
> > > > 2014-08-10 7:12 GMT-04:00 Jean-Marc Spaggiari <
> jean-marc@spaggiari.org
> > >:
> > > >
> > > > > +1 too for a tool to produce a hash of a table. Like, one hash per
> > > > region,
> > > > > or as Lars said, one hash per range. You define the number of
> buckets
> > > you
> > > > > want, run the MR job, which produce a list of hash, and compare
> that
> > > from
> > > > > the 2 clusters. Might be pretty simple to do. The more buckets you
> > > > define,
> > > > > the less risk you have to have a hash collision. We can even have a
> > > > global
> > > > > hash and one hash per bucket, and other options...
> > > > >
> > > > >
> > > > > 2014-08-10 1:59 GMT-04:00 anil gupta <an...@gmail.com>:
> > > > >
> > > > > +1 for MerkleTree or Range Hash based implementation. We had a
> table
> > > > with 1
> > > > >> Billion records. We ran verifyRep for that table across two Data
> > > Centers
> > > > >> and it took close to 1 week to finish. It seems at present,
> > VerifyRep
> > > > >> comapres every row byte by byte.
> > > > >>
> > > > >>
> > > > >> On Sat, Aug 9, 2014 at 6:11 PM, lars hofhansl <la...@apache.org>
> > > wrote:
> > > > >>
> > > > >> > VerifyReplication is something you could use. It's not
> replication
> > > > >> > specific, just named that way because it was initially conceived
> > as
> > > a
> > > > >> tool
> > > > >> > to verify that replication is working correctly. Unfortunately
> it
> > > will
> > > > >> need
> > > > >> > to ship all data from the remote cluster, which is quite
> > > inefficient.
> > > > >> > I think we should include a better way with HBase, maybe using
> > > > >> > Merkletrees, or at least hashes of ranges, and compare those.
> > > > >> >
> > > > >> > -- Lars
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > ________________________________
> > > > >> >  From: Colin Kincaid Williams <di...@uw.edu>
> > > > >> > To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> > > > >> > Sent: Saturday, August 9, 2014 2:28 PM
> > > > >> > Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> > > > >> copytable
> > > > >> > operation.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > Hi Everybody,
> > > > >> >
> > > > >> > I do wish to upgrade to a more recent hbase soon. However the
> > choice
> > > > >> isn't
> > > > >> > entirely mine. Does anybody know how to verify the contents
> > between
> > > > >> tables
> > > > >> > across clusters after a copytable operation?
> > > > >> > I see replication.VerifyReplication , but that seems replication
> > > > >> specific.
> > > > >> > Maybe I should have began with replication in the first place...
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <larsh@apache.org
> >
> > > > wrote:
> > > > >> >
> > > > >> > Hi Colin,
> > > > >> > >
> > > > >> > >you might want to consider upgrading. The current stable
> version
> > is
> > > > >> > 0.98.4 (soon .5).
> > > > >> > >
> > > > >> > >Even just going to 0.94 will give a lot of new features,
> > stability,
> > > > and
> > > > >> > performance.
> > > > >> > >0.92.x can be upgraded to 0.94.x without any downtime and
> without
> > > any
> > > > >> > upgrade steps necessary.
> > > > >> > >For an upgrade to 0.98 and later you'd need some downtime and
> > also
> > > > >> excute
> > > > >> > an upgrade step.
> > > > >> > >
> > > > >> > >
> > > > >> > >-- Lars
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >----- Original Message -----
> > > > >> > >From: Colin Kincaid Williams <di...@uw.edu>
> > > > >> > >To: user@hbase.apache.org
> > > > >> > >Cc:
> > > > >> > >Sent: Friday, August 8, 2014 1:16 PM
> > > > >> > >Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> > > > >> copytable
> > > > >> > operation.
> > > > >> > >
> > > > >> > >Not in the hbase shell I have:
> > > > >> > >
> > > > >> > >hbase version
> > > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
> > > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Subversion
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> >file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
> > > > >> > >-r Unknown
> > > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on
> > Sat
> > > > >> Jan 26
> > > > >> > >17:11:38 PST 2013
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com>
> > > wrote:
> > > > >> > >
> > > > >> > >> Using simplified version of your command, I saw the following
> > in
> > > > >> shell
> > > > >> > >> output (you may have noticed as well):
> > > > >> > >>
> > > > >> > >> An argument ignored (unknown or overridden): BLOOMFILTER
> > > > >> > >> An argument ignored (unknown or overridden): VERSIONS
> > > > >> > >> 0 row(s) in 2.1110 seconds
> > > > >> > >>
> > > > >> > >> Cheers
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <
> > > > >> discord@uw.edu
> > > > >> > >
> > > > >> > >> wrote:
> > > > >> > >>
> > > > >> > >> > I have discovered the error. I made the mistake regarding
> the
> > > > >> > compression
> > > > >> > >> > and the bloom filter. The new table doesn't have them
> > enabled,
> > > > and
> > > > >> the
> > > > >> > >> old
> > > > >> > >> > does. However I'm wondering how I can create tables with
> > splits
> > > > >> and bf
> > > > >> > >> and
> > > > >> > >> > compression enabled. Shouldn't the following command return
> > an
> > > > >> error?
> > > > >> > >> >
> > > > >> > >> > hbase(main):001:0> create 'ADMd5','a',{
> > > > >> > >> >
> > > > >> > >> > hbase(main):002:1* BLOOMFILTER => 'ROW',
> > > > >> > >> > hbase(main):003:1* VERSIONS => '1',
> > > > >> > >> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> > > > >> > >> > hbase(main):005:1* MIN_VERSIONS => '0',
> > > > >> > >> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > > >> > >> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> > > > >> > >> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> > > > >> > >> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> > > > >> > >> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> > > > >> > >> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > > >> > >> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> > > > >> > >> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> > > > >> > >> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> > > > >> > >> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> > > > >> > >> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > > >> > >> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> > > > >> > >> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > > >> > >> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> > > > >> > >> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> > > > >> > >> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > > >> > >> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > > >> > >> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> > > > >> > >> > 0 row(s) in 1.8010 seconds
> > > > >> > >> >
> > > > >> > >> > hbase(main):024:0> describe 'ADMd5'
> > > > >> > >> > DESCRIPTION                                        ENABLED
> > > > >> > >> >
> > > > >> > >> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> > > > >> > >> >
> > > > >> > >> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> > > > >> > >> >
> > > > >> > >> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> > > > >> > >> >
> > > > >> > >> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> > > > >> > >> >
> > > > >> > >> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> > > > >> > >> >
> > > > >> > >> > 1 row(s) in 0.0420 seconds
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> > > > >> > >> > jean-marc@spaggiari.org
> > > > >> > >> > > wrote:
> > > > >> > >> >
> > > > >> > >> > > Hi Colin,
> > > > >> > >> > >
> > > > >> > >> > > Just to make sure.
> > > > >> > >> > >
> > > > >> > >> > > Is table A from the source cluster and not compressed,
> and
> > > > table
> > > > >> B
> > > > >> > in
> > > > >> > >> the
> > > > >> > >> > > destination cluster and SNAPPY compressed? Is that
> correct?
> > > > Then
> > > > >> > ratio
> > > > >> > >> > > should be the opposite. Are you able to du -h from hadoop
> > to
> > > > see
> > > > >> if
> > > > >> > all
> > > > >> > >> > > regions are evenly bigger or if anything else is wrong?
> > > > >> > >> > >
> > > > >> > >> > >
> > > > >> > >> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <
> > > > >> discord@uw.edu>:
> > > > >> > >> > >
> > > > >> > >> > > > I haven't yet tried to major compact table B. I will
> look
> > > up
> > > > >> some
> > > > >> > >> > > > documentation on WALs and snapshots to find this
> > > information
> > > > in
> > > > >> > the
> > > > >> > >> > hdfs
> > > > >> > >> > > > filesystem tomorrow. Could it be caused by the
> > bloomfilter
> > > > >> > existing
> > > > >> > >> on
> > > > >> > >> > > > table B, but not table A? The funny thing is the source
> > > table
> > > > >> is
> > > > >> > >> > smaller
> > > > >> > >> > > > than the destination.
> > > > >> > >> > > >
> > > > >> > >> > > >
> > > > >> > >> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> > > > >> > >> > esteban@cloudera.com>
> > > > >> > >> > > > wrote:
> > > > >> > >> > > >
> > > > >> > >> > > > > Hi Colin,
> > > > >> > >> > > > >
> > > > >> > >> > > > > Have you verified if the content of /a_d includes
> WALs
> > > > and/or
> > > > >> > the
> > > > >> > >> > > content
> > > > >> > >> > > > > of the snapshots or the HBase archive? have you tried
> > to
> > > > >> major
> > > > >> > >> > compact
> > > > >> > >> > > > > table B?  does it makes any difference?
> > > > >> > >> > > > >
> > > > >> > >> > > > > regards,
> > > > >> > >> > > > > esteban.
> > > > >> > >> > > > >
> > > > >> > >> > > > >
> > > > >> > >> > > > >
> > > > >> > >> > > > > --
> > > > >> > >> > > > > Cloudera, Inc.
> > > > >> > >> > > > >
> > > > >> > >> > > > >
> > > > >> > >> > > > >
> > > > >> > >> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid
> Williams
> > <
> > > > >> > >> > discord@uw.edu
> > > > >> > >> > > >
> > > > >> > >> > > > > wrote:
> > > > >> > >> > > > >
> > > > >> > >> > > > > > I used the copy table command to copy a database
> > > between
> > > > >> the
> > > > >> > >> > original
> > > > >> > >> > > > > > cluster A and a new cluster B. I have noticed that
> > the
> > > > >> > rootdir is
> > > > >> > >> > > > larger
> > > > >> > >> > > > > > than 2X the size of the original. I am trying to
> > > account
> > > > >> for
> > > > >> > >> such a
> > > > >> > >> > > > large
> > > > >> > >> > > > > > difference. The following are some details about
> the
> > > > table.
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > I'm trying to figure out why my copied table is
> more
> > > than
> > > > >> 2X
> > > > >> > the
> > > > >> > >> > size
> > > > >> > >> > > > of
> > > > >> > >> > > > > > the original table. Could the bloomfilter itself
> > > account
> > > > >> for
> > > > >> > >> this?
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > The guide I used as a reference:
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > >
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > Supposedly the original command used to create the
> > > table
> > > > on
> > > > >> > >> cluster
> > > > >> > >> > > A:
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW',
> > > > >> VERSIONS
> > > > >> > =>
> > > > >> > >> > '1',
> > > > >> > >> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > How I created the target table on cluster B:
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > create 'ADMd5','a',{
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > BLOOMFILTER => 'ROW',
> > > > >> > >> > > > > > VERSIONS => '1',
> > > > >> > >> > > > > > COMPRESSION => 'SNAPPY',
> > > > >> > >> > > > > > MIN_VERSIONS => '0',
> > > > >> > >> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > > >> > >> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > > > >> > >> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> > > > >> > >> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > > > >> > >> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > > > >> > >> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > > >> > >> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > > > >> > >> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > > > >> > >> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > > > >> > >> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > > > >> > >> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > > >> > >> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > > > >> > >> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > > >> > >> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > > > >> > >> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > > > >> > >> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > > >> > >> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > > >> > >> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > How the tables now appear in hbase shell:
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > table A:
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > describe 'ADMd5'
> > > > >> > >> > > > > > DESCRIPTION
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >   ENABLED
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a',
> > > BLOOMFILTER
> > > > >> =>
> > > > >> > >> > 'NONE',
> > > > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3',
> > COMPRESSION
> > > =>
> > > > >> > 'NONE',
> > > > >> > >> > > > MIN_VER
> > > > >> > >> > > > > > true
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> > > '65536',
> > > > >> > >> IN_MEMORY
> > > > >> > >> > > =>
> > > > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > 1 row(s) in 0.0370 seconds
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > table B:
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > hbase(main):003:0> describe 'ADMd5'
> > > > >> > >> > > > > > DESCRIPTION
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >   ENABLED
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a',
> > > BLOOMFILTER
> > > > >> =>
> > > > >> > >> 'ROW',
> > > > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1',
> > COMPRESSION
> > > =>
> > > > >> > >> 'SNAPPY',
> > > > >> > >> > > > > MIN_VE
> > > > >> > >> > > > > > true
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> > > > '65536',
> > > > >> > >> > IN_MEMORY
> > > > >> > >> > > =>
> > > > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > 1 row(s) in 0.0280 seconds
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > The containing foldersize in hdfs:
> > > > >> > >> > > > > > table A:
> > > > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > >> > >> > > > > > 227.4g  /a_d
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > table B:
> > > > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > >> > >> > > > > > 501.0g  /a_d
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > > >
> https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > > > >> > >> > > > > >
> > > > >> > >> > > > >
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> > >
> > > > >> > >
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Thanks & Regards,
> > > > >> Anil Gupta
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
> >
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Colin Kincaid Williams <di...@uw.edu>.
it appears that there is a bug in the copytable operation. We are missing a
large amount of data after copying between clusters. I don't know if I can
provide a sample data set or not, but I can try to dig up some details.
 One of our developers rewrote the operation using another library, and is
testing his copy now.


On Sun, Aug 10, 2014 at 11:24 PM, anil gupta <an...@gmail.com> wrote:

> Hi Colin,
>
> We also faced the scenario where after copying Table "A" from cluster to 1
> to 2. Size of hdfs files between clusters was not equal. We also assumed
> that it should be equal. Hence we ran verifyRep job.
> I don't know whats the reason behind this discrepancy but i just wanted to
> share this so that you are aware that you are not the only one facing this.
>
> ~Anil
>
>
> On Sun, Aug 10, 2014 at 2:21 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
> > By the way I have copied the table across clusters, with the tables
> > configured the same. the source cluster has an underlying ext2
> filesystem,
> > while the dest cluster had an underlying ext4 filesystem. The counts are
> > the same for the tables. Will the filesystem account for the difference
> in
> > directory size?
> >
> > [root@clusterA_ext2 ~]# sudo -u hdfs hadoop fs -dus -h /a_d/
> > dus: DEPRECATED: Please use 'du -s' instead.
> > 225.9g  /a_d
> >
> >
> > [root@clusterB_ext4 ~]#  sudo -u hdfs hadoop fs -dus -h /a_d/
> > dus: DEPRECATED: Please use 'du -s' instead.
> > 172.8g  /a_d
> >
> >
> >
> > On Sun, Aug 10, 2014 at 4:17 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > HBASE-11715 <https://issues.apache.org/jira/browse/HBASE-11715>
> opened.
> > >
> > >
> > > 2014-08-10 7:12 GMT-04:00 Jean-Marc Spaggiari <jean-marc@spaggiari.org
> >:
> > >
> > > > +1 too for a tool to produce a hash of a table. Like, one hash per
> > > region,
> > > > or as Lars said, one hash per range. You define the number of buckets
> > you
> > > > want, run the MR job, which produce a list of hash, and compare that
> > from
> > > > the 2 clusters. Might be pretty simple to do. The more buckets you
> > > define,
> > > > the less risk you have to have a hash collision. We can even have a
> > > global
> > > > hash and one hash per bucket, and other options...
> > > >
> > > >
> > > > 2014-08-10 1:59 GMT-04:00 anil gupta <an...@gmail.com>:
> > > >
> > > > +1 for MerkleTree or Range Hash based implementation. We had a table
> > > with 1
> > > >> Billion records. We ran verifyRep for that table across two Data
> > Centers
> > > >> and it took close to 1 week to finish. It seems at present,
> VerifyRep
> > > >> comapres every row byte by byte.
> > > >>
> > > >>
> > > >> On Sat, Aug 9, 2014 at 6:11 PM, lars hofhansl <la...@apache.org>
> > wrote:
> > > >>
> > > >> > VerifyReplication is something you could use. It's not replication
> > > >> > specific, just named that way because it was initially conceived
> as
> > a
> > > >> tool
> > > >> > to verify that replication is working correctly. Unfortunately it
> > will
> > > >> need
> > > >> > to ship all data from the remote cluster, which is quite
> > inefficient.
> > > >> > I think we should include a better way with HBase, maybe using
> > > >> > Merkletrees, or at least hashes of ranges, and compare those.
> > > >> >
> > > >> > -- Lars
> > > >> >
> > > >> >
> > > >> >
> > > >> > ________________________________
> > > >> >  From: Colin Kincaid Williams <di...@uw.edu>
> > > >> > To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> > > >> > Sent: Saturday, August 9, 2014 2:28 PM
> > > >> > Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> > > >> copytable
> > > >> > operation.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Hi Everybody,
> > > >> >
> > > >> > I do wish to upgrade to a more recent hbase soon. However the
> choice
> > > >> isn't
> > > >> > entirely mine. Does anybody know how to verify the contents
> between
> > > >> tables
> > > >> > across clusters after a copytable operation?
> > > >> > I see replication.VerifyReplication , but that seems replication
> > > >> specific.
> > > >> > Maybe I should have began with replication in the first place...
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <la...@apache.org>
> > > wrote:
> > > >> >
> > > >> > Hi Colin,
> > > >> > >
> > > >> > >you might want to consider upgrading. The current stable version
> is
> > > >> > 0.98.4 (soon .5).
> > > >> > >
> > > >> > >Even just going to 0.94 will give a lot of new features,
> stability,
> > > and
> > > >> > performance.
> > > >> > >0.92.x can be upgraded to 0.94.x without any downtime and without
> > any
> > > >> > upgrade steps necessary.
> > > >> > >For an upgrade to 0.98 and later you'd need some downtime and
> also
> > > >> excute
> > > >> > an upgrade step.
> > > >> > >
> > > >> > >
> > > >> > >-- Lars
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >----- Original Message -----
> > > >> > >From: Colin Kincaid Williams <di...@uw.edu>
> > > >> > >To: user@hbase.apache.org
> > > >> > >Cc:
> > > >> > >Sent: Friday, August 8, 2014 1:16 PM
> > > >> > >Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> > > >> copytable
> > > >> > operation.
> > > >> > >
> > > >> > >Not in the hbase shell I have:
> > > >> > >
> > > >> > >hbase version
> > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
> > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Subversion
> > > >> >
> > > >> >
> > > >>
> > >
> >
> >file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
> > > >> > >-r Unknown
> > > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on
> Sat
> > > >> Jan 26
> > > >> > >17:11:38 PST 2013
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > >> > >
> > > >> > >> Using simplified version of your command, I saw the following
> in
> > > >> shell
> > > >> > >> output (you may have noticed as well):
> > > >> > >>
> > > >> > >> An argument ignored (unknown or overridden): BLOOMFILTER
> > > >> > >> An argument ignored (unknown or overridden): VERSIONS
> > > >> > >> 0 row(s) in 2.1110 seconds
> > > >> > >>
> > > >> > >> Cheers
> > > >> > >>
> > > >> > >>
> > > >> > >> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <
> > > >> discord@uw.edu
> > > >> > >
> > > >> > >> wrote:
> > > >> > >>
> > > >> > >> > I have discovered the error. I made the mistake regarding the
> > > >> > compression
> > > >> > >> > and the bloom filter. The new table doesn't have them
> enabled,
> > > and
> > > >> the
> > > >> > >> old
> > > >> > >> > does. However I'm wondering how I can create tables with
> splits
> > > >> and bf
> > > >> > >> and
> > > >> > >> > compression enabled. Shouldn't the following command return
> an
> > > >> error?
> > > >> > >> >
> > > >> > >> > hbase(main):001:0> create 'ADMd5','a',{
> > > >> > >> >
> > > >> > >> > hbase(main):002:1* BLOOMFILTER => 'ROW',
> > > >> > >> > hbase(main):003:1* VERSIONS => '1',
> > > >> > >> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> > > >> > >> > hbase(main):005:1* MIN_VERSIONS => '0',
> > > >> > >> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > >> > >> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> > > >> > >> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> > > >> > >> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> > > >> > >> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> > > >> > >> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > >> > >> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> > > >> > >> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> > > >> > >> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> > > >> > >> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> > > >> > >> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > >> > >> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> > > >> > >> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > >> > >> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> > > >> > >> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> > > >> > >> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > >> > >> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > >> > >> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> > > >> > >> > 0 row(s) in 1.8010 seconds
> > > >> > >> >
> > > >> > >> > hbase(main):024:0> describe 'ADMd5'
> > > >> > >> > DESCRIPTION                                        ENABLED
> > > >> > >> >
> > > >> > >> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> > > >> > >> >
> > > >> > >> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> > > >> > >> >
> > > >> > >> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> > > >> > >> >
> > > >> > >> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> > > >> > >> >
> > > >> > >> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> > > >> > >> >
> > > >> > >> > 1 row(s) in 0.0420 seconds
> > > >> > >> >
> > > >> > >> >
> > > >> > >> >
> > > >> > >> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> > > >> > >> > jean-marc@spaggiari.org
> > > >> > >> > > wrote:
> > > >> > >> >
> > > >> > >> > > Hi Colin,
> > > >> > >> > >
> > > >> > >> > > Just to make sure.
> > > >> > >> > >
> > > >> > >> > > Is table A from the source cluster and not compressed, and
> > > table
> > > >> B
> > > >> > in
> > > >> > >> the
> > > >> > >> > > destination cluster and SNAPPY compressed? Is that correct?
> > > Then
> > > >> > ratio
> > > >> > >> > > should be the opposite. Are you able to du -h from hadoop
> to
> > > see
> > > >> if
> > > >> > all
> > > >> > >> > > regions are evenly bigger or if anything else is wrong?
> > > >> > >> > >
> > > >> > >> > >
> > > >> > >> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <
> > > >> discord@uw.edu>:
> > > >> > >> > >
> > > >> > >> > > > I haven't yet tried to major compact table B. I will look
> > up
> > > >> some
> > > >> > >> > > > documentation on WALs and snapshots to find this
> > information
> > > in
> > > >> > the
> > > >> > >> > hdfs
> > > >> > >> > > > filesystem tomorrow. Could it be caused by the
> bloomfilter
> > > >> > existing
> > > >> > >> on
> > > >> > >> > > > table B, but not table A? The funny thing is the source
> > table
> > > >> is
> > > >> > >> > smaller
> > > >> > >> > > > than the destination.
> > > >> > >> > > >
> > > >> > >> > > >
> > > >> > >> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> > > >> > >> > esteban@cloudera.com>
> > > >> > >> > > > wrote:
> > > >> > >> > > >
> > > >> > >> > > > > Hi Colin,
> > > >> > >> > > > >
> > > >> > >> > > > > Have you verified if the content of /a_d includes WALs
> > > and/or
> > > >> > the
> > > >> > >> > > content
> > > >> > >> > > > > of the snapshots or the HBase archive? have you tried
> to
> > > >> major
> > > >> > >> > compact
> > > >> > >> > > > > table B?  does it makes any difference?
> > > >> > >> > > > >
> > > >> > >> > > > > regards,
> > > >> > >> > > > > esteban.
> > > >> > >> > > > >
> > > >> > >> > > > >
> > > >> > >> > > > >
> > > >> > >> > > > > --
> > > >> > >> > > > > Cloudera, Inc.
> > > >> > >> > > > >
> > > >> > >> > > > >
> > > >> > >> > > > >
> > > >> > >> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams
> <
> > > >> > >> > discord@uw.edu
> > > >> > >> > > >
> > > >> > >> > > > > wrote:
> > > >> > >> > > > >
> > > >> > >> > > > > > I used the copy table command to copy a database
> > between
> > > >> the
> > > >> > >> > original
> > > >> > >> > > > > > cluster A and a new cluster B. I have noticed that
> the
> > > >> > rootdir is
> > > >> > >> > > > larger
> > > >> > >> > > > > > than 2X the size of the original. I am trying to
> > account
> > > >> for
> > > >> > >> such a
> > > >> > >> > > > large
> > > >> > >> > > > > > difference. The following are some details about the
> > > table.
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > > I'm trying to figure out why my copied table is more
> > than
> > > >> 2X
> > > >> > the
> > > >> > >> > size
> > > >> > >> > > > of
> > > >> > >> > > > > > the original table. Could the bloomfilter itself
> > account
> > > >> for
> > > >> > >> this?
> > > >> > >> > > > > >
> > > >> > >> > > > > > The guide I used as a reference:
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > >
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > > Supposedly the original command used to create the
> > table
> > > on
> > > >> > >> cluster
> > > >> > >> > > A:
> > > >> > >> > > > > >
> > > >> > >> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW',
> > > >> VERSIONS
> > > >> > =>
> > > >> > >> > '1',
> > > >> > >> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > > How I created the target table on cluster B:
> > > >> > >> > > > > >
> > > >> > >> > > > > > create 'ADMd5','a',{
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > > BLOOMFILTER => 'ROW',
> > > >> > >> > > > > > VERSIONS => '1',
> > > >> > >> > > > > > COMPRESSION => 'SNAPPY',
> > > >> > >> > > > > > MIN_VERSIONS => '0',
> > > >> > >> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > >> > >> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > > >> > >> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> > > >> > >> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > > >> > >> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > > >> > >> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > >> > >> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > > >> > >> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > > >> > >> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > > >> > >> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > > >> > >> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > >> > >> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > > >> > >> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > >> > >> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > > >> > >> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > > >> > >> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > >> > >> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > >> > >> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > > How the tables now appear in hbase shell:
> > > >> > >> > > > > >
> > > >> > >> > > > > > table A:
> > > >> > >> > > > > >
> > > >> > >> > > > > > describe 'ADMd5'
> > > >> > >> > > > > > DESCRIPTION
> > > >> > >> > > > > >
> > > >> > >> > > > > >   ENABLED
> > > >> > >> > > > > >
> > > >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a',
> > BLOOMFILTER
> > > >> =>
> > > >> > >> > 'NONE',
> > > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3',
> COMPRESSION
> > =>
> > > >> > 'NONE',
> > > >> > >> > > > MIN_VER
> > > >> > >> > > > > > true
> > > >> > >> > > > > >
> > > >> > >> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> > '65536',
> > > >> > >> IN_MEMORY
> > > >> > >> > > =>
> > > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > > 1 row(s) in 0.0370 seconds
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > > table B:
> > > >> > >> > > > > >
> > > >> > >> > > > > > hbase(main):003:0> describe 'ADMd5'
> > > >> > >> > > > > > DESCRIPTION
> > > >> > >> > > > > >
> > > >> > >> > > > > >   ENABLED
> > > >> > >> > > > > >
> > > >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a',
> > BLOOMFILTER
> > > >> =>
> > > >> > >> 'ROW',
> > > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1',
> COMPRESSION
> > =>
> > > >> > >> 'SNAPPY',
> > > >> > >> > > > > MIN_VE
> > > >> > >> > > > > > true
> > > >> > >> > > > > >
> > > >> > >> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> > > '65536',
> > > >> > >> > IN_MEMORY
> > > >> > >> > > =>
> > > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > > 1 row(s) in 0.0280 seconds
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > > The containing foldersize in hdfs:
> > > >> > >> > > > > > table A:
> > > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > >> > >> > > > > > 227.4g  /a_d
> > > >> > >> > > > > >
> > > >> > >> > > > > > table B:
> > > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > >> > >> > > > > > 501.0g  /a_d
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > > >> > >> > > > > >
> > > >> > >> > > > >
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> > >
> > > >> > >
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Thanks & Regards,
> > > >> Anil Gupta
> > > >>
> > > >
> > > >
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by anil gupta <an...@gmail.com>.
Hi Colin,

We also faced the scenario where after copying Table "A" from cluster to 1
to 2. Size of hdfs files between clusters was not equal. We also assumed
that it should be equal. Hence we ran verifyRep job.
I don't know whats the reason behind this discrepancy but i just wanted to
share this so that you are aware that you are not the only one facing this.

~Anil


On Sun, Aug 10, 2014 at 2:21 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> By the way I have copied the table across clusters, with the tables
> configured the same. the source cluster has an underlying ext2 filesystem,
> while the dest cluster had an underlying ext4 filesystem. The counts are
> the same for the tables. Will the filesystem account for the difference in
> directory size?
>
> [root@clusterA_ext2 ~]# sudo -u hdfs hadoop fs -dus -h /a_d/
> dus: DEPRECATED: Please use 'du -s' instead.
> 225.9g  /a_d
>
>
> [root@clusterB_ext4 ~]#  sudo -u hdfs hadoop fs -dus -h /a_d/
> dus: DEPRECATED: Please use 'du -s' instead.
> 172.8g  /a_d
>
>
>
> On Sun, Aug 10, 2014 at 4:17 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > HBASE-11715 <https://issues.apache.org/jira/browse/HBASE-11715> opened.
> >
> >
> > 2014-08-10 7:12 GMT-04:00 Jean-Marc Spaggiari <je...@spaggiari.org>:
> >
> > > +1 too for a tool to produce a hash of a table. Like, one hash per
> > region,
> > > or as Lars said, one hash per range. You define the number of buckets
> you
> > > want, run the MR job, which produce a list of hash, and compare that
> from
> > > the 2 clusters. Might be pretty simple to do. The more buckets you
> > define,
> > > the less risk you have to have a hash collision. We can even have a
> > global
> > > hash and one hash per bucket, and other options...
> > >
> > >
> > > 2014-08-10 1:59 GMT-04:00 anil gupta <an...@gmail.com>:
> > >
> > > +1 for MerkleTree or Range Hash based implementation. We had a table
> > with 1
> > >> Billion records. We ran verifyRep for that table across two Data
> Centers
> > >> and it took close to 1 week to finish. It seems at present, VerifyRep
> > >> comapres every row byte by byte.
> > >>
> > >>
> > >> On Sat, Aug 9, 2014 at 6:11 PM, lars hofhansl <la...@apache.org>
> wrote:
> > >>
> > >> > VerifyReplication is something you could use. It's not replication
> > >> > specific, just named that way because it was initially conceived as
> a
> > >> tool
> > >> > to verify that replication is working correctly. Unfortunately it
> will
> > >> need
> > >> > to ship all data from the remote cluster, which is quite
> inefficient.
> > >> > I think we should include a better way with HBase, maybe using
> > >> > Merkletrees, or at least hashes of ranges, and compare those.
> > >> >
> > >> > -- Lars
> > >> >
> > >> >
> > >> >
> > >> > ________________________________
> > >> >  From: Colin Kincaid Williams <di...@uw.edu>
> > >> > To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> > >> > Sent: Saturday, August 9, 2014 2:28 PM
> > >> > Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> > >> copytable
> > >> > operation.
> > >> >
> > >> >
> > >> >
> > >> > Hi Everybody,
> > >> >
> > >> > I do wish to upgrade to a more recent hbase soon. However the choice
> > >> isn't
> > >> > entirely mine. Does anybody know how to verify the contents between
> > >> tables
> > >> > across clusters after a copytable operation?
> > >> > I see replication.VerifyReplication , but that seems replication
> > >> specific.
> > >> > Maybe I should have began with replication in the first place...
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <la...@apache.org>
> > wrote:
> > >> >
> > >> > Hi Colin,
> > >> > >
> > >> > >you might want to consider upgrading. The current stable version is
> > >> > 0.98.4 (soon .5).
> > >> > >
> > >> > >Even just going to 0.94 will give a lot of new features, stability,
> > and
> > >> > performance.
> > >> > >0.92.x can be upgraded to 0.94.x without any downtime and without
> any
> > >> > upgrade steps necessary.
> > >> > >For an upgrade to 0.98 and later you'd need some downtime and also
> > >> excute
> > >> > an upgrade step.
> > >> > >
> > >> > >
> > >> > >-- Lars
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >----- Original Message -----
> > >> > >From: Colin Kincaid Williams <di...@uw.edu>
> > >> > >To: user@hbase.apache.org
> > >> > >Cc:
> > >> > >Sent: Friday, August 8, 2014 1:16 PM
> > >> > >Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> > >> copytable
> > >> > operation.
> > >> > >
> > >> > >Not in the hbase shell I have:
> > >> > >
> > >> > >hbase version
> > >> > >14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
> > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Subversion
> > >> >
> > >> >
> > >>
> >
> >file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
> > >> > >-r Unknown
> > >> > >14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat
> > >> Jan 26
> > >> > >17:11:38 PST 2013
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > >> > >
> > >> > >> Using simplified version of your command, I saw the following in
> > >> shell
> > >> > >> output (you may have noticed as well):
> > >> > >>
> > >> > >> An argument ignored (unknown or overridden): BLOOMFILTER
> > >> > >> An argument ignored (unknown or overridden): VERSIONS
> > >> > >> 0 row(s) in 2.1110 seconds
> > >> > >>
> > >> > >> Cheers
> > >> > >>
> > >> > >>
> > >> > >> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <
> > >> discord@uw.edu
> > >> > >
> > >> > >> wrote:
> > >> > >>
> > >> > >> > I have discovered the error. I made the mistake regarding the
> > >> > compression
> > >> > >> > and the bloom filter. The new table doesn't have them enabled,
> > and
> > >> the
> > >> > >> old
> > >> > >> > does. However I'm wondering how I can create tables with splits
> > >> and bf
> > >> > >> and
> > >> > >> > compression enabled. Shouldn't the following command return an
> > >> error?
> > >> > >> >
> > >> > >> > hbase(main):001:0> create 'ADMd5','a',{
> > >> > >> >
> > >> > >> > hbase(main):002:1* BLOOMFILTER => 'ROW',
> > >> > >> > hbase(main):003:1* VERSIONS => '1',
> > >> > >> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> > >> > >> > hbase(main):005:1* MIN_VERSIONS => '0',
> > >> > >> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > >> > >> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> > >> > >> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> > >> > >> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> > >> > >> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> > >> > >> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > >> > >> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> > >> > >> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> > >> > >> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> > >> > >> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> > >> > >> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > >> > >> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> > >> > >> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> > >> > >> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> > >> > >> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> > >> > >> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> > >> > >> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> > >> > >> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> > >> > >> > 0 row(s) in 1.8010 seconds
> > >> > >> >
> > >> > >> > hbase(main):024:0> describe 'ADMd5'
> > >> > >> > DESCRIPTION                                        ENABLED
> > >> > >> >
> > >> > >> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> > >> > >> >
> > >> > >> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> > >> > >> >
> > >> > >> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> > >> > >> >
> > >> > >> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> > >> > >> >
> > >> > >> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> > >> > >> >
> > >> > >> > 1 row(s) in 0.0420 seconds
> > >> > >> >
> > >> > >> >
> > >> > >> >
> > >> > >> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> > >> > >> > jean-marc@spaggiari.org
> > >> > >> > > wrote:
> > >> > >> >
> > >> > >> > > Hi Colin,
> > >> > >> > >
> > >> > >> > > Just to make sure.
> > >> > >> > >
> > >> > >> > > Is table A from the source cluster and not compressed, and
> > table
> > >> B
> > >> > in
> > >> > >> the
> > >> > >> > > destination cluster and SNAPPY compressed? Is that correct?
> > Then
> > >> > ratio
> > >> > >> > > should be the opposite. Are you able to du -h from hadoop to
> > see
> > >> if
> > >> > all
> > >> > >> > > regions are evenly bigger or if anything else is wrong?
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <
> > >> discord@uw.edu>:
> > >> > >> > >
> > >> > >> > > > I haven't yet tried to major compact table B. I will look
> up
> > >> some
> > >> > >> > > > documentation on WALs and snapshots to find this
> information
> > in
> > >> > the
> > >> > >> > hdfs
> > >> > >> > > > filesystem tomorrow. Could it be caused by the bloomfilter
> > >> > existing
> > >> > >> on
> > >> > >> > > > table B, but not table A? The funny thing is the source
> table
> > >> is
> > >> > >> > smaller
> > >> > >> > > > than the destination.
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> > >> > >> > esteban@cloudera.com>
> > >> > >> > > > wrote:
> > >> > >> > > >
> > >> > >> > > > > Hi Colin,
> > >> > >> > > > >
> > >> > >> > > > > Have you verified if the content of /a_d includes WALs
> > and/or
> > >> > the
> > >> > >> > > content
> > >> > >> > > > > of the snapshots or the HBase archive? have you tried to
> > >> major
> > >> > >> > compact
> > >> > >> > > > > table B?  does it makes any difference?
> > >> > >> > > > >
> > >> > >> > > > > regards,
> > >> > >> > > > > esteban.
> > >> > >> > > > >
> > >> > >> > > > >
> > >> > >> > > > >
> > >> > >> > > > > --
> > >> > >> > > > > Cloudera, Inc.
> > >> > >> > > > >
> > >> > >> > > > >
> > >> > >> > > > >
> > >> > >> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
> > >> > >> > discord@uw.edu
> > >> > >> > > >
> > >> > >> > > > > wrote:
> > >> > >> > > > >
> > >> > >> > > > > > I used the copy table command to copy a database
> between
> > >> the
> > >> > >> > original
> > >> > >> > > > > > cluster A and a new cluster B. I have noticed that the
> > >> > rootdir is
> > >> > >> > > > larger
> > >> > >> > > > > > than 2X the size of the original. I am trying to
> account
> > >> for
> > >> > >> such a
> > >> > >> > > > large
> > >> > >> > > > > > difference. The following are some details about the
> > table.
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > I'm trying to figure out why my copied table is more
> than
> > >> 2X
> > >> > the
> > >> > >> > size
> > >> > >> > > > of
> > >> > >> > > > > > the original table. Could the bloomfilter itself
> account
> > >> for
> > >> > >> this?
> > >> > >> > > > > >
> > >> > >> > > > > > The guide I used as a reference:
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > Supposedly the original command used to create the
> table
> > on
> > >> > >> cluster
> > >> > >> > > A:
> > >> > >> > > > > >
> > >> > >> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW',
> > >> VERSIONS
> > >> > =>
> > >> > >> > '1',
> > >> > >> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > How I created the target table on cluster B:
> > >> > >> > > > > >
> > >> > >> > > > > > create 'ADMd5','a',{
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > BLOOMFILTER => 'ROW',
> > >> > >> > > > > > VERSIONS => '1',
> > >> > >> > > > > > COMPRESSION => 'SNAPPY',
> > >> > >> > > > > > MIN_VERSIONS => '0',
> > >> > >> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > >> > >> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > >> > >> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> > >> > >> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > >> > >> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > >> > >> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > >> > >> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > >> > >> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > >> > >> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > >> > >> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > >> > >> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > >> > >> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > >> > >> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > >> > >> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > >> > >> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > >> > >> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > >> > >> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > >> > >> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > How the tables now appear in hbase shell:
> > >> > >> > > > > >
> > >> > >> > > > > > table A:
> > >> > >> > > > > >
> > >> > >> > > > > > describe 'ADMd5'
> > >> > >> > > > > > DESCRIPTION
> > >> > >> > > > > >
> > >> > >> > > > > >   ENABLED
> > >> > >> > > > > >
> > >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a',
> BLOOMFILTER
> > >> =>
> > >> > >> > 'NONE',
> > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION
> =>
> > >> > 'NONE',
> > >> > >> > > > MIN_VER
> > >> > >> > > > > > true
> > >> > >> > > > > >
> > >> > >> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> '65536',
> > >> > >> IN_MEMORY
> > >> > >> > > =>
> > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > 1 row(s) in 0.0370 seconds
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > table B:
> > >> > >> > > > > >
> > >> > >> > > > > > hbase(main):003:0> describe 'ADMd5'
> > >> > >> > > > > > DESCRIPTION
> > >> > >> > > > > >
> > >> > >> > > > > >   ENABLED
> > >> > >> > > > > >
> > >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a',
> BLOOMFILTER
> > >> =>
> > >> > >> 'ROW',
> > >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION
> =>
> > >> > >> 'SNAPPY',
> > >> > >> > > > > MIN_VE
> > >> > >> > > > > > true
> > >> > >> > > > > >
> > >> > >> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> > '65536',
> > >> > >> > IN_MEMORY
> > >> > >> > > =>
> > >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > 1 row(s) in 0.0280 seconds
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > The containing foldersize in hdfs:
> > >> > >> > > > > > table A:
> > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > >> > >> > > > > > 227.4g  /a_d
> > >> > >> > > > > >
> > >> > >> > > > > > table B:
> > >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > >> > >> > > > > > 501.0g  /a_d
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > >> > >> > > > > >
> > >> > >> > > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> > >
> > >> > >
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Thanks & Regards,
> > >> Anil Gupta
> > >>
> > >
> > >
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Colin Kincaid Williams <di...@uw.edu>.
By the way I have copied the table across clusters, with the tables
configured the same. the source cluster has an underlying ext2 filesystem,
while the dest cluster had an underlying ext4 filesystem. The counts are
the same for the tables. Will the filesystem account for the difference in
directory size?

[root@clusterA_ext2 ~]# sudo -u hdfs hadoop fs -dus -h /a_d/
dus: DEPRECATED: Please use 'du -s' instead.
225.9g  /a_d


[root@clusterB_ext4 ~]#  sudo -u hdfs hadoop fs -dus -h /a_d/
dus: DEPRECATED: Please use 'du -s' instead.
172.8g  /a_d



On Sun, Aug 10, 2014 at 4:17 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> HBASE-11715 <https://issues.apache.org/jira/browse/HBASE-11715> opened.
>
>
> 2014-08-10 7:12 GMT-04:00 Jean-Marc Spaggiari <je...@spaggiari.org>:
>
> > +1 too for a tool to produce a hash of a table. Like, one hash per
> region,
> > or as Lars said, one hash per range. You define the number of buckets you
> > want, run the MR job, which produce a list of hash, and compare that from
> > the 2 clusters. Might be pretty simple to do. The more buckets you
> define,
> > the less risk you have to have a hash collision. We can even have a
> global
> > hash and one hash per bucket, and other options...
> >
> >
> > 2014-08-10 1:59 GMT-04:00 anil gupta <an...@gmail.com>:
> >
> > +1 for MerkleTree or Range Hash based implementation. We had a table
> with 1
> >> Billion records. We ran verifyRep for that table across two Data Centers
> >> and it took close to 1 week to finish. It seems at present, VerifyRep
> >> comapres every row byte by byte.
> >>
> >>
> >> On Sat, Aug 9, 2014 at 6:11 PM, lars hofhansl <la...@apache.org> wrote:
> >>
> >> > VerifyReplication is something you could use. It's not replication
> >> > specific, just named that way because it was initially conceived as a
> >> tool
> >> > to verify that replication is working correctly. Unfortunately it will
> >> need
> >> > to ship all data from the remote cluster, which is quite inefficient.
> >> > I think we should include a better way with HBase, maybe using
> >> > Merkletrees, or at least hashes of ranges, and compare those.
> >> >
> >> > -- Lars
> >> >
> >> >
> >> >
> >> > ________________________________
> >> >  From: Colin Kincaid Williams <di...@uw.edu>
> >> > To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> >> > Sent: Saturday, August 9, 2014 2:28 PM
> >> > Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> >> copytable
> >> > operation.
> >> >
> >> >
> >> >
> >> > Hi Everybody,
> >> >
> >> > I do wish to upgrade to a more recent hbase soon. However the choice
> >> isn't
> >> > entirely mine. Does anybody know how to verify the contents between
> >> tables
> >> > across clusters after a copytable operation?
> >> > I see replication.VerifyReplication , but that seems replication
> >> specific.
> >> > Maybe I should have began with replication in the first place...
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <la...@apache.org>
> wrote:
> >> >
> >> > Hi Colin,
> >> > >
> >> > >you might want to consider upgrading. The current stable version is
> >> > 0.98.4 (soon .5).
> >> > >
> >> > >Even just going to 0.94 will give a lot of new features, stability,
> and
> >> > performance.
> >> > >0.92.x can be upgraded to 0.94.x without any downtime and without any
> >> > upgrade steps necessary.
> >> > >For an upgrade to 0.98 and later you'd need some downtime and also
> >> excute
> >> > an upgrade step.
> >> > >
> >> > >
> >> > >-- Lars
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >----- Original Message -----
> >> > >From: Colin Kincaid Williams <di...@uw.edu>
> >> > >To: user@hbase.apache.org
> >> > >Cc:
> >> > >Sent: Friday, August 8, 2014 1:16 PM
> >> > >Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> >> copytable
> >> > operation.
> >> > >
> >> > >Not in the hbase shell I have:
> >> > >
> >> > >hbase version
> >> > >14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
> >> > >14/08/08 14:16:08 INFO util.VersionInfo: Subversion
> >> >
> >> >
> >>
> >file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
> >> > >-r Unknown
> >> > >14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat
> >> Jan 26
> >> > >17:11:38 PST 2013
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com> wrote:
> >> > >
> >> > >> Using simplified version of your command, I saw the following in
> >> shell
> >> > >> output (you may have noticed as well):
> >> > >>
> >> > >> An argument ignored (unknown or overridden): BLOOMFILTER
> >> > >> An argument ignored (unknown or overridden): VERSIONS
> >> > >> 0 row(s) in 2.1110 seconds
> >> > >>
> >> > >> Cheers
> >> > >>
> >> > >>
> >> > >> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <
> >> discord@uw.edu
> >> > >
> >> > >> wrote:
> >> > >>
> >> > >> > I have discovered the error. I made the mistake regarding the
> >> > compression
> >> > >> > and the bloom filter. The new table doesn't have them enabled,
> and
> >> the
> >> > >> old
> >> > >> > does. However I'm wondering how I can create tables with splits
> >> and bf
> >> > >> and
> >> > >> > compression enabled. Shouldn't the following command return an
> >> error?
> >> > >> >
> >> > >> > hbase(main):001:0> create 'ADMd5','a',{
> >> > >> >
> >> > >> > hbase(main):002:1* BLOOMFILTER => 'ROW',
> >> > >> > hbase(main):003:1* VERSIONS => '1',
> >> > >> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> >> > >> > hbase(main):005:1* MIN_VERSIONS => '0',
> >> > >> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> >> > >> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> >> > >> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> >> > >> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> >> > >> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> >> > >> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> >> > >> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> >> > >> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> >> > >> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> >> > >> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> >> > >> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> >> > >> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> >> > >> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> >> > >> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> >> > >> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> >> > >> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> >> > >> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> >> > >> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> >> > >> > 0 row(s) in 1.8010 seconds
> >> > >> >
> >> > >> > hbase(main):024:0> describe 'ADMd5'
> >> > >> > DESCRIPTION                                        ENABLED
> >> > >> >
> >> > >> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> >> > >> >
> >> > >> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> >> > >> >
> >> > >> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> >> > >> >
> >> > >> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> >> > >> >
> >> > >> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> >> > >> >
> >> > >> > 1 row(s) in 0.0420 seconds
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> >> > >> > jean-marc@spaggiari.org
> >> > >> > > wrote:
> >> > >> >
> >> > >> > > Hi Colin,
> >> > >> > >
> >> > >> > > Just to make sure.
> >> > >> > >
> >> > >> > > Is table A from the source cluster and not compressed, and
> table
> >> B
> >> > in
> >> > >> the
> >> > >> > > destination cluster and SNAPPY compressed? Is that correct?
> Then
> >> > ratio
> >> > >> > > should be the opposite. Are you able to du -h from hadoop to
> see
> >> if
> >> > all
> >> > >> > > regions are evenly bigger or if anything else is wrong?
> >> > >> > >
> >> > >> > >
> >> > >> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <
> >> discord@uw.edu>:
> >> > >> > >
> >> > >> > > > I haven't yet tried to major compact table B. I will look up
> >> some
> >> > >> > > > documentation on WALs and snapshots to find this information
> in
> >> > the
> >> > >> > hdfs
> >> > >> > > > filesystem tomorrow. Could it be caused by the bloomfilter
> >> > existing
> >> > >> on
> >> > >> > > > table B, but not table A? The funny thing is the source table
> >> is
> >> > >> > smaller
> >> > >> > > > than the destination.
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> >> > >> > esteban@cloudera.com>
> >> > >> > > > wrote:
> >> > >> > > >
> >> > >> > > > > Hi Colin,
> >> > >> > > > >
> >> > >> > > > > Have you verified if the content of /a_d includes WALs
> and/or
> >> > the
> >> > >> > > content
> >> > >> > > > > of the snapshots or the HBase archive? have you tried to
> >> major
> >> > >> > compact
> >> > >> > > > > table B?  does it makes any difference?
> >> > >> > > > >
> >> > >> > > > > regards,
> >> > >> > > > > esteban.
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > > --
> >> > >> > > > > Cloudera, Inc.
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
> >> > >> > discord@uw.edu
> >> > >> > > >
> >> > >> > > > > wrote:
> >> > >> > > > >
> >> > >> > > > > > I used the copy table command to copy a database between
> >> the
> >> > >> > original
> >> > >> > > > > > cluster A and a new cluster B. I have noticed that the
> >> > rootdir is
> >> > >> > > > larger
> >> > >> > > > > > than 2X the size of the original. I am trying to account
> >> for
> >> > >> such a
> >> > >> > > > large
> >> > >> > > > > > difference. The following are some details about the
> table.
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > I'm trying to figure out why my copied table is more than
> >> 2X
> >> > the
> >> > >> > size
> >> > >> > > > of
> >> > >> > > > > > the original table. Could the bloomfilter itself account
> >> for
> >> > >> this?
> >> > >> > > > > >
> >> > >> > > > > > The guide I used as a reference:
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > Supposedly the original command used to create the table
> on
> >> > >> cluster
> >> > >> > > A:
> >> > >> > > > > >
> >> > >> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW',
> >> VERSIONS
> >> > =>
> >> > >> > '1',
> >> > >> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > How I created the target table on cluster B:
> >> > >> > > > > >
> >> > >> > > > > > create 'ADMd5','a',{
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > BLOOMFILTER => 'ROW',
> >> > >> > > > > > VERSIONS => '1',
> >> > >> > > > > > COMPRESSION => 'SNAPPY',
> >> > >> > > > > > MIN_VERSIONS => '0',
> >> > >> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> >> > >> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> >> > >> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> >> > >> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> >> > >> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> >> > >> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> >> > >> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> >> > >> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> >> > >> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> >> > >> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> >> > >> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> >> > >> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> >> > >> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> >> > >> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> >> > >> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> >> > >> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> >> > >> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> >> > >> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > How the tables now appear in hbase shell:
> >> > >> > > > > >
> >> > >> > > > > > table A:
> >> > >> > > > > >
> >> > >> > > > > > describe 'ADMd5'
> >> > >> > > > > > DESCRIPTION
> >> > >> > > > > >
> >> > >> > > > > >   ENABLED
> >> > >> > > > > >
> >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER
> >> =>
> >> > >> > 'NONE',
> >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION =>
> >> > 'NONE',
> >> > >> > > > MIN_VER
> >> > >> > > > > > true
> >> > >> > > > > >
> >> > >> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> >> > >> IN_MEMORY
> >> > >> > > =>
> >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > 1 row(s) in 0.0370 seconds
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > table B:
> >> > >> > > > > >
> >> > >> > > > > > hbase(main):003:0> describe 'ADMd5'
> >> > >> > > > > > DESCRIPTION
> >> > >> > > > > >
> >> > >> > > > > >   ENABLED
> >> > >> > > > > >
> >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER
> >> =>
> >> > >> 'ROW',
> >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>
> >> > >> 'SNAPPY',
> >> > >> > > > > MIN_VE
> >> > >> > > > > > true
> >> > >> > > > > >
> >> > >> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> '65536',
> >> > >> > IN_MEMORY
> >> > >> > > =>
> >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > 1 row(s) in 0.0280 seconds
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > The containing foldersize in hdfs:
> >> > >> > > > > > table A:
> >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> >> > >> > > > > > 227.4g  /a_d
> >> > >> > > > > >
> >> > >> > > > > > table B:
> >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> >> > >> > > > > > 501.0g  /a_d
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >>
> >>
> >>
> >>
> >> --
> >> Thanks & Regards,
> >> Anil Gupta
> >>
> >
> >
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
HBASE-11715 <https://issues.apache.org/jira/browse/HBASE-11715> opened.


2014-08-10 7:12 GMT-04:00 Jean-Marc Spaggiari <je...@spaggiari.org>:

> +1 too for a tool to produce a hash of a table. Like, one hash per region,
> or as Lars said, one hash per range. You define the number of buckets you
> want, run the MR job, which produce a list of hash, and compare that from
> the 2 clusters. Might be pretty simple to do. The more buckets you define,
> the less risk you have to have a hash collision. We can even have a global
> hash and one hash per bucket, and other options...
>
>
> 2014-08-10 1:59 GMT-04:00 anil gupta <an...@gmail.com>:
>
> +1 for MerkleTree or Range Hash based implementation. We had a table with 1
>> Billion records. We ran verifyRep for that table across two Data Centers
>> and it took close to 1 week to finish. It seems at present, VerifyRep
>> comapres every row byte by byte.
>>
>>
>> On Sat, Aug 9, 2014 at 6:11 PM, lars hofhansl <la...@apache.org> wrote:
>>
>> > VerifyReplication is something you could use. It's not replication
>> > specific, just named that way because it was initially conceived as a
>> tool
>> > to verify that replication is working correctly. Unfortunately it will
>> need
>> > to ship all data from the remote cluster, which is quite inefficient.
>> > I think we should include a better way with HBase, maybe using
>> > Merkletrees, or at least hashes of ranges, and compare those.
>> >
>> > -- Lars
>> >
>> >
>> >
>> > ________________________________
>> >  From: Colin Kincaid Williams <di...@uw.edu>
>> > To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>> > Sent: Saturday, August 9, 2014 2:28 PM
>> > Subject: Re: Large discrepancy in hdfs hbase rootdir size after
>> copytable
>> > operation.
>> >
>> >
>> >
>> > Hi Everybody,
>> >
>> > I do wish to upgrade to a more recent hbase soon. However the choice
>> isn't
>> > entirely mine. Does anybody know how to verify the contents between
>> tables
>> > across clusters after a copytable operation?
>> > I see replication.VerifyReplication , but that seems replication
>> specific.
>> > Maybe I should have began with replication in the first place...
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <la...@apache.org> wrote:
>> >
>> > Hi Colin,
>> > >
>> > >you might want to consider upgrading. The current stable version is
>> > 0.98.4 (soon .5).
>> > >
>> > >Even just going to 0.94 will give a lot of new features, stability, and
>> > performance.
>> > >0.92.x can be upgraded to 0.94.x without any downtime and without any
>> > upgrade steps necessary.
>> > >For an upgrade to 0.98 and later you'd need some downtime and also
>> excute
>> > an upgrade step.
>> > >
>> > >
>> > >-- Lars
>> > >
>> > >
>> > >
>> > >
>> > >----- Original Message -----
>> > >From: Colin Kincaid Williams <di...@uw.edu>
>> > >To: user@hbase.apache.org
>> > >Cc:
>> > >Sent: Friday, August 8, 2014 1:16 PM
>> > >Subject: Re: Large discrepancy in hdfs hbase rootdir size after
>> copytable
>> > operation.
>> > >
>> > >Not in the hbase shell I have:
>> > >
>> > >hbase version
>> > >14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
>> > >14/08/08 14:16:08 INFO util.VersionInfo: Subversion
>> >
>> >
>> >file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
>> > >-r Unknown
>> > >14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat
>> Jan 26
>> > >17:11:38 PST 2013
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com> wrote:
>> > >
>> > >> Using simplified version of your command, I saw the following in
>> shell
>> > >> output (you may have noticed as well):
>> > >>
>> > >> An argument ignored (unknown or overridden): BLOOMFILTER
>> > >> An argument ignored (unknown or overridden): VERSIONS
>> > >> 0 row(s) in 2.1110 seconds
>> > >>
>> > >> Cheers
>> > >>
>> > >>
>> > >> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <
>> discord@uw.edu
>> > >
>> > >> wrote:
>> > >>
>> > >> > I have discovered the error. I made the mistake regarding the
>> > compression
>> > >> > and the bloom filter. The new table doesn't have them enabled, and
>> the
>> > >> old
>> > >> > does. However I'm wondering how I can create tables with splits
>> and bf
>> > >> and
>> > >> > compression enabled. Shouldn't the following command return an
>> error?
>> > >> >
>> > >> > hbase(main):001:0> create 'ADMd5','a',{
>> > >> >
>> > >> > hbase(main):002:1* BLOOMFILTER => 'ROW',
>> > >> > hbase(main):003:1* VERSIONS => '1',
>> > >> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
>> > >> > hbase(main):005:1* MIN_VERSIONS => '0',
>> > >> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
>> > >> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
>> > >> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
>> > >> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
>> > >> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
>> > >> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
>> > >> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
>> > >> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
>> > >> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
>> > >> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
>> > >> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
>> > >> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
>> > >> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
>> > >> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
>> > >> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
>> > >> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
>> > >> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
>> > >> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
>> > >> > 0 row(s) in 1.8010 seconds
>> > >> >
>> > >> > hbase(main):024:0> describe 'ADMd5'
>> > >> > DESCRIPTION                                        ENABLED
>> > >> >
>> > >> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
>> > >> >
>> > >> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
>> > >> >
>> > >> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
>> > >> >
>> > >> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
>> > >> >
>> > >> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
>> > >> >
>> > >> > 1 row(s) in 0.0420 seconds
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
>> > >> > jean-marc@spaggiari.org
>> > >> > > wrote:
>> > >> >
>> > >> > > Hi Colin,
>> > >> > >
>> > >> > > Just to make sure.
>> > >> > >
>> > >> > > Is table A from the source cluster and not compressed, and table
>> B
>> > in
>> > >> the
>> > >> > > destination cluster and SNAPPY compressed? Is that correct? Then
>> > ratio
>> > >> > > should be the opposite. Are you able to du -h from hadoop to see
>> if
>> > all
>> > >> > > regions are evenly bigger or if anything else is wrong?
>> > >> > >
>> > >> > >
>> > >> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <
>> discord@uw.edu>:
>> > >> > >
>> > >> > > > I haven't yet tried to major compact table B. I will look up
>> some
>> > >> > > > documentation on WALs and snapshots to find this information in
>> > the
>> > >> > hdfs
>> > >> > > > filesystem tomorrow. Could it be caused by the bloomfilter
>> > existing
>> > >> on
>> > >> > > > table B, but not table A? The funny thing is the source table
>> is
>> > >> > smaller
>> > >> > > > than the destination.
>> > >> > > >
>> > >> > > >
>> > >> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
>> > >> > esteban@cloudera.com>
>> > >> > > > wrote:
>> > >> > > >
>> > >> > > > > Hi Colin,
>> > >> > > > >
>> > >> > > > > Have you verified if the content of /a_d includes WALs and/or
>> > the
>> > >> > > content
>> > >> > > > > of the snapshots or the HBase archive? have you tried to
>> major
>> > >> > compact
>> > >> > > > > table B?  does it makes any difference?
>> > >> > > > >
>> > >> > > > > regards,
>> > >> > > > > esteban.
>> > >> > > > >
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > --
>> > >> > > > > Cloudera, Inc.
>> > >> > > > >
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
>> > >> > discord@uw.edu
>> > >> > > >
>> > >> > > > > wrote:
>> > >> > > > >
>> > >> > > > > > I used the copy table command to copy a database between
>> the
>> > >> > original
>> > >> > > > > > cluster A and a new cluster B. I have noticed that the
>> > rootdir is
>> > >> > > > larger
>> > >> > > > > > than 2X the size of the original. I am trying to account
>> for
>> > >> such a
>> > >> > > > large
>> > >> > > > > > difference. The following are some details about the table.
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > I'm trying to figure out why my copied table is more than
>> 2X
>> > the
>> > >> > size
>> > >> > > > of
>> > >> > > > > > the original table. Could the bloomfilter itself account
>> for
>> > >> this?
>> > >> > > > > >
>> > >> > > > > > The guide I used as a reference:
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > Supposedly the original command used to create the table on
>> > >> cluster
>> > >> > > A:
>> > >> > > > > >
>> > >> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW',
>> VERSIONS
>> > =>
>> > >> > '1',
>> > >> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > How I created the target table on cluster B:
>> > >> > > > > >
>> > >> > > > > > create 'ADMd5','a',{
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > BLOOMFILTER => 'ROW',
>> > >> > > > > > VERSIONS => '1',
>> > >> > > > > > COMPRESSION => 'SNAPPY',
>> > >> > > > > > MIN_VERSIONS => '0',
>> > >> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
>> > >> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
>> > >> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
>> > >> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
>> > >> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
>> > >> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
>> > >> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
>> > >> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
>> > >> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
>> > >> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
>> > >> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
>> > >> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
>> > >> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
>> > >> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
>> > >> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
>> > >> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
>> > >> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
>> > >> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > How the tables now appear in hbase shell:
>> > >> > > > > >
>> > >> > > > > > table A:
>> > >> > > > > >
>> > >> > > > > > describe 'ADMd5'
>> > >> > > > > > DESCRIPTION
>> > >> > > > > >
>> > >> > > > > >   ENABLED
>> > >> > > > > >
>> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER
>> =>
>> > >> > 'NONE',
>> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION =>
>> > 'NONE',
>> > >> > > > MIN_VER
>> > >> > > > > > true
>> > >> > > > > >
>> > >> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
>> > >> IN_MEMORY
>> > >> > > =>
>> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > 1 row(s) in 0.0370 seconds
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > table B:
>> > >> > > > > >
>> > >> > > > > > hbase(main):003:0> describe 'ADMd5'
>> > >> > > > > > DESCRIPTION
>> > >> > > > > >
>> > >> > > > > >   ENABLED
>> > >> > > > > >
>> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER
>> =>
>> > >> 'ROW',
>> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>
>> > >> 'SNAPPY',
>> > >> > > > > MIN_VE
>> > >> > > > > > true
>> > >> > > > > >
>> > >> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
>> > >> > IN_MEMORY
>> > >> > > =>
>> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > 1 row(s) in 0.0280 seconds
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > The containing foldersize in hdfs:
>> > >> > > > > > table A:
>> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
>> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
>> > >> > > > > > 227.4g  /a_d
>> > >> > > > > >
>> > >> > > > > > table B:
>> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
>> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
>> > >> > > > > > 501.0g  /a_d
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
+1 too for a tool to produce a hash of a table. Like, one hash per region,
or as Lars said, one hash per range. You define the number of buckets you
want, run the MR job, which produce a list of hash, and compare that from
the 2 clusters. Might be pretty simple to do. The more buckets you define,
the less risk you have to have a hash collision. We can even have a global
hash and one hash per bucket, and other options...


2014-08-10 1:59 GMT-04:00 anil gupta <an...@gmail.com>:

> +1 for MerkleTree or Range Hash based implementation. We had a table with 1
> Billion records. We ran verifyRep for that table across two Data Centers
> and it took close to 1 week to finish. It seems at present, VerifyRep
> comapres every row byte by byte.
>
>
> On Sat, Aug 9, 2014 at 6:11 PM, lars hofhansl <la...@apache.org> wrote:
>
> > VerifyReplication is something you could use. It's not replication
> > specific, just named that way because it was initially conceived as a
> tool
> > to verify that replication is working correctly. Unfortunately it will
> need
> > to ship all data from the remote cluster, which is quite inefficient.
> > I think we should include a better way with HBase, maybe using
> > Merkletrees, or at least hashes of ranges, and compare those.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Colin Kincaid Williams <di...@uw.edu>
> > To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> > Sent: Saturday, August 9, 2014 2:28 PM
> > Subject: Re: Large discrepancy in hdfs hbase rootdir size after copytable
> > operation.
> >
> >
> >
> > Hi Everybody,
> >
> > I do wish to upgrade to a more recent hbase soon. However the choice
> isn't
> > entirely mine. Does anybody know how to verify the contents between
> tables
> > across clusters after a copytable operation?
> > I see replication.VerifyReplication , but that seems replication
> specific.
> > Maybe I should have began with replication in the first place...
> >
> >
> >
> >
> >
> >
> > On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <la...@apache.org> wrote:
> >
> > Hi Colin,
> > >
> > >you might want to consider upgrading. The current stable version is
> > 0.98.4 (soon .5).
> > >
> > >Even just going to 0.94 will give a lot of new features, stability, and
> > performance.
> > >0.92.x can be upgraded to 0.94.x without any downtime and without any
> > upgrade steps necessary.
> > >For an upgrade to 0.98 and later you'd need some downtime and also
> excute
> > an upgrade step.
> > >
> > >
> > >-- Lars
> > >
> > >
> > >
> > >
> > >----- Original Message -----
> > >From: Colin Kincaid Williams <di...@uw.edu>
> > >To: user@hbase.apache.org
> > >Cc:
> > >Sent: Friday, August 8, 2014 1:16 PM
> > >Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> copytable
> > operation.
> > >
> > >Not in the hbase shell I have:
> > >
> > >hbase version
> > >14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
> > >14/08/08 14:16:08 INFO util.VersionInfo: Subversion
> >
> >
> >file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
> > >-r Unknown
> > >14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat Jan
> 26
> > >17:11:38 PST 2013
> > >
> > >
> > >
> > >
> > >
> > >
> > >On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > >> Using simplified version of your command, I saw the following in shell
> > >> output (you may have noticed as well):
> > >>
> > >> An argument ignored (unknown or overridden): BLOOMFILTER
> > >> An argument ignored (unknown or overridden): VERSIONS
> > >> 0 row(s) in 2.1110 seconds
> > >>
> > >> Cheers
> > >>
> > >>
> > >> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <
> discord@uw.edu
> > >
> > >> wrote:
> > >>
> > >> > I have discovered the error. I made the mistake regarding the
> > compression
> > >> > and the bloom filter. The new table doesn't have them enabled, and
> the
> > >> old
> > >> > does. However I'm wondering how I can create tables with splits and
> bf
> > >> and
> > >> > compression enabled. Shouldn't the following command return an
> error?
> > >> >
> > >> > hbase(main):001:0> create 'ADMd5','a',{
> > >> >
> > >> > hbase(main):002:1* BLOOMFILTER => 'ROW',
> > >> > hbase(main):003:1* VERSIONS => '1',
> > >> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> > >> > hbase(main):005:1* MIN_VERSIONS => '0',
> > >> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > >> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> > >> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> > >> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> > >> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> > >> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > >> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> > >> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> > >> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> > >> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> > >> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > >> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> > >> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> > >> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> > >> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> > >> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> > >> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> > >> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> > >> > 0 row(s) in 1.8010 seconds
> > >> >
> > >> > hbase(main):024:0> describe 'ADMd5'
> > >> > DESCRIPTION                                        ENABLED
> > >> >
> > >> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> > >> >
> > >> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> > >> >
> > >> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> > >> >
> > >> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> > >> >
> > >> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> > >> >
> > >> > 1 row(s) in 0.0420 seconds
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> > >> > jean-marc@spaggiari.org
> > >> > > wrote:
> > >> >
> > >> > > Hi Colin,
> > >> > >
> > >> > > Just to make sure.
> > >> > >
> > >> > > Is table A from the source cluster and not compressed, and table B
> > in
> > >> the
> > >> > > destination cluster and SNAPPY compressed? Is that correct? Then
> > ratio
> > >> > > should be the opposite. Are you able to du -h from hadoop to see
> if
> > all
> > >> > > regions are evenly bigger or if anything else is wrong?
> > >> > >
> > >> > >
> > >> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <discord@uw.edu
> >:
> > >> > >
> > >> > > > I haven't yet tried to major compact table B. I will look up
> some
> > >> > > > documentation on WALs and snapshots to find this information in
> > the
> > >> > hdfs
> > >> > > > filesystem tomorrow. Could it be caused by the bloomfilter
> > existing
> > >> on
> > >> > > > table B, but not table A? The funny thing is the source table is
> > >> > smaller
> > >> > > > than the destination.
> > >> > > >
> > >> > > >
> > >> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> > >> > esteban@cloudera.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi Colin,
> > >> > > > >
> > >> > > > > Have you verified if the content of /a_d includes WALs and/or
> > the
> > >> > > content
> > >> > > > > of the snapshots or the HBase archive? have you tried to major
> > >> > compact
> > >> > > > > table B?  does it makes any difference?
> > >> > > > >
> > >> > > > > regards,
> > >> > > > > esteban.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > Cloudera, Inc.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
> > >> > discord@uw.edu
> > >> > > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > I used the copy table command to copy a database between the
> > >> > original
> > >> > > > > > cluster A and a new cluster B. I have noticed that the
> > rootdir is
> > >> > > > larger
> > >> > > > > > than 2X the size of the original. I am trying to account for
> > >> such a
> > >> > > > large
> > >> > > > > > difference. The following are some details about the table.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > I'm trying to figure out why my copied table is more than 2X
> > the
> > >> > size
> > >> > > > of
> > >> > > > > > the original table. Could the bloomfilter itself account for
> > >> this?
> > >> > > > > >
> > >> > > > > > The guide I used as a reference:
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Supposedly the original command used to create the table on
> > >> cluster
> > >> > > A:
> > >> > > > > >
> > >> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS
> > =>
> > >> > '1',
> > >> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > How I created the target table on cluster B:
> > >> > > > > >
> > >> > > > > > create 'ADMd5','a',{
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > BLOOMFILTER => 'ROW',
> > >> > > > > > VERSIONS => '1',
> > >> > > > > > COMPRESSION => 'SNAPPY',
> > >> > > > > > MIN_VERSIONS => '0',
> > >> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > >> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > >> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> > >> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > >> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > >> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > >> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > >> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > >> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > >> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > >> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > >> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > >> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > >> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > >> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > >> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > >> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > >> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > How the tables now appear in hbase shell:
> > >> > > > > >
> > >> > > > > > table A:
> > >> > > > > >
> > >> > > > > > describe 'ADMd5'
> > >> > > > > > DESCRIPTION
> > >> > > > > >
> > >> > > > > >   ENABLED
> > >> > > > > >
> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> > >> > 'NONE',
> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION =>
> > 'NONE',
> > >> > > > MIN_VER
> > >> > > > > > true
> > >> > > > > >
> > >> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> > >> IN_MEMORY
> > >> > > =>
> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > 1 row(s) in 0.0370 seconds
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > table B:
> > >> > > > > >
> > >> > > > > > hbase(main):003:0> describe 'ADMd5'
> > >> > > > > > DESCRIPTION
> > >> > > > > >
> > >> > > > > >   ENABLED
> > >> > > > > >
> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> > >> 'ROW',
> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>
> > >> 'SNAPPY',
> > >> > > > > MIN_VE
> > >> > > > > > true
> > >> > > > > >
> > >> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> > >> > IN_MEMORY
> > >> > > =>
> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > 1 row(s) in 0.0280 seconds
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > The containing foldersize in hdfs:
> > >> > > > > > table A:
> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > >> > > > > > 227.4g  /a_d
> > >> > > > > >
> > >> > > > > > table B:
> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > >> > > > > > 501.0g  /a_d
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
>
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by anil gupta <an...@gmail.com>.
+1 for MerkleTree or Range Hash based implementation. We had a table with 1
Billion records. We ran verifyRep for that table across two Data Centers
and it took close to 1 week to finish. It seems at present, VerifyRep
comapres every row byte by byte.


On Sat, Aug 9, 2014 at 6:11 PM, lars hofhansl <la...@apache.org> wrote:

> VerifyReplication is something you could use. It's not replication
> specific, just named that way because it was initially conceived as a tool
> to verify that replication is working correctly. Unfortunately it will need
> to ship all data from the remote cluster, which is quite inefficient.
> I think we should include a better way with HBase, maybe using
> Merkletrees, or at least hashes of ranges, and compare those.
>
> -- Lars
>
>
>
> ________________________________
>  From: Colin Kincaid Williams <di...@uw.edu>
> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> Sent: Saturday, August 9, 2014 2:28 PM
> Subject: Re: Large discrepancy in hdfs hbase rootdir size after copytable
> operation.
>
>
>
> Hi Everybody,
>
> I do wish to upgrade to a more recent hbase soon. However the choice isn't
> entirely mine. Does anybody know how to verify the contents between tables
> across clusters after a copytable operation?
> I see replication.VerifyReplication , but that seems replication specific.
> Maybe I should have began with replication in the first place...
>
>
>
>
>
>
> On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <la...@apache.org> wrote:
>
> Hi Colin,
> >
> >you might want to consider upgrading. The current stable version is
> 0.98.4 (soon .5).
> >
> >Even just going to 0.94 will give a lot of new features, stability, and
> performance.
> >0.92.x can be upgraded to 0.94.x without any downtime and without any
> upgrade steps necessary.
> >For an upgrade to 0.98 and later you'd need some downtime and also excute
> an upgrade step.
> >
> >
> >-- Lars
> >
> >
> >
> >
> >----- Original Message -----
> >From: Colin Kincaid Williams <di...@uw.edu>
> >To: user@hbase.apache.org
> >Cc:
> >Sent: Friday, August 8, 2014 1:16 PM
> >Subject: Re: Large discrepancy in hdfs hbase rootdir size after copytable
> operation.
> >
> >Not in the hbase shell I have:
> >
> >hbase version
> >14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
> >14/08/08 14:16:08 INFO util.VersionInfo: Subversion
>
> >file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
> >-r Unknown
> >14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat Jan 26
> >17:11:38 PST 2013
> >
> >
> >
> >
> >
> >
> >On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> Using simplified version of your command, I saw the following in shell
> >> output (you may have noticed as well):
> >>
> >> An argument ignored (unknown or overridden): BLOOMFILTER
> >> An argument ignored (unknown or overridden): VERSIONS
> >> 0 row(s) in 2.1110 seconds
> >>
> >> Cheers
> >>
> >>
> >> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <discord@uw.edu
> >
> >> wrote:
> >>
> >> > I have discovered the error. I made the mistake regarding the
> compression
> >> > and the bloom filter. The new table doesn't have them enabled, and the
> >> old
> >> > does. However I'm wondering how I can create tables with splits and bf
> >> and
> >> > compression enabled. Shouldn't the following command return an error?
> >> >
> >> > hbase(main):001:0> create 'ADMd5','a',{
> >> >
> >> > hbase(main):002:1* BLOOMFILTER => 'ROW',
> >> > hbase(main):003:1* VERSIONS => '1',
> >> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> >> > hbase(main):005:1* MIN_VERSIONS => '0',
> >> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> >> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> >> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> >> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> >> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> >> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> >> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> >> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> >> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> >> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> >> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> >> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> >> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> >> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> >> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> >> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> >> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> >> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> >> > 0 row(s) in 1.8010 seconds
> >> >
> >> > hbase(main):024:0> describe 'ADMd5'
> >> > DESCRIPTION                                        ENABLED
> >> >
> >> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> >> >
> >> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> >> >
> >> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> >> >
> >> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> >> >
> >> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> >> >
> >> > 1 row(s) in 0.0420 seconds
> >> >
> >> >
> >> >
> >> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> >> > jean-marc@spaggiari.org
> >> > > wrote:
> >> >
> >> > > Hi Colin,
> >> > >
> >> > > Just to make sure.
> >> > >
> >> > > Is table A from the source cluster and not compressed, and table B
> in
> >> the
> >> > > destination cluster and SNAPPY compressed? Is that correct? Then
> ratio
> >> > > should be the opposite. Are you able to du -h from hadoop to see if
> all
> >> > > regions are evenly bigger or if anything else is wrong?
> >> > >
> >> > >
> >> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <di...@uw.edu>:
> >> > >
> >> > > > I haven't yet tried to major compact table B. I will look up some
> >> > > > documentation on WALs and snapshots to find this information in
> the
> >> > hdfs
> >> > > > filesystem tomorrow. Could it be caused by the bloomfilter
> existing
> >> on
> >> > > > table B, but not table A? The funny thing is the source table is
> >> > smaller
> >> > > > than the destination.
> >> > > >
> >> > > >
> >> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> >> > esteban@cloudera.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Hi Colin,
> >> > > > >
> >> > > > > Have you verified if the content of /a_d includes WALs and/or
> the
> >> > > content
> >> > > > > of the snapshots or the HBase archive? have you tried to major
> >> > compact
> >> > > > > table B?  does it makes any difference?
> >> > > > >
> >> > > > > regards,
> >> > > > > esteban.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Cloudera, Inc.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
> >> > discord@uw.edu
> >> > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > I used the copy table command to copy a database between the
> >> > original
> >> > > > > > cluster A and a new cluster B. I have noticed that the
> rootdir is
> >> > > > larger
> >> > > > > > than 2X the size of the original. I am trying to account for
> >> such a
> >> > > > large
> >> > > > > > difference. The following are some details about the table.
> >> > > > > >
> >> > > > > >
> >> > > > > > I'm trying to figure out why my copied table is more than 2X
> the
> >> > size
> >> > > > of
> >> > > > > > the original table. Could the bloomfilter itself account for
> >> this?
> >> > > > > >
> >> > > > > > The guide I used as a reference:
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > Supposedly the original command used to create the table on
> >> cluster
> >> > > A:
> >> > > > > >
> >> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS
> =>
> >> > '1',
> >> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> >> > > > > >
> >> > > > > >
> >> > > > > > How I created the target table on cluster B:
> >> > > > > >
> >> > > > > > create 'ADMd5','a',{
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > BLOOMFILTER => 'ROW',
> >> > > > > > VERSIONS => '1',
> >> > > > > > COMPRESSION => 'SNAPPY',
> >> > > > > > MIN_VERSIONS => '0',
> >> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> >> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> >> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> >> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> >> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> >> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> >> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> >> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> >> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> >> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> >> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> >> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> >> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> >> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> >> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> >> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> >> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> >> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> >> > > > > >
> >> > > > > >
> >> > > > > > How the tables now appear in hbase shell:
> >> > > > > >
> >> > > > > > table A:
> >> > > > > >
> >> > > > > > describe 'ADMd5'
> >> > > > > > DESCRIPTION
> >> > > > > >
> >> > > > > >   ENABLED
> >> > > > > >
> >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> >> > 'NONE',
> >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION =>
> 'NONE',
> >> > > > MIN_VER
> >> > > > > > true
> >> > > > > >
> >> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> >> IN_MEMORY
> >> > > =>
> >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> >> > > > > >
> >> > > > > >
> >> > > > > > 1 row(s) in 0.0370 seconds
> >> > > > > >
> >> > > > > >
> >> > > > > > table B:
> >> > > > > >
> >> > > > > > hbase(main):003:0> describe 'ADMd5'
> >> > > > > > DESCRIPTION
> >> > > > > >
> >> > > > > >   ENABLED
> >> > > > > >
> >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> >> 'ROW',
> >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>
> >> 'SNAPPY',
> >> > > > > MIN_VE
> >> > > > > > true
> >> > > > > >
> >> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> >> > IN_MEMORY
> >> > > =>
> >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> >> > > > > >
> >> > > > > >
> >> > > > > > 1 row(s) in 0.0280 seconds
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > The containing foldersize in hdfs:
> >> > > > > > table A:
> >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> >> > > > > > 227.4g  /a_d
> >> > > > > >
> >> > > > > > table B:
> >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> >> > > > > > 501.0g  /a_d
> >> > > > > >
> >> > > > > >
> >> > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >




-- 
Thanks & Regards,
Anil Gupta

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by lars hofhansl <la...@apache.org>.
VerifyReplication is something you could use. It's not replication specific, just named that way because it was initially conceived as a tool to verify that replication is working correctly. Unfortunately it will need to ship all data from the remote cluster, which is quite inefficient.
I think we should include a better way with HBase, maybe using Merkletrees, or at least hashes of ranges, and compare those.

-- Lars



________________________________
 From: Colin Kincaid Williams <di...@uw.edu>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org> 
Sent: Saturday, August 9, 2014 2:28 PM
Subject: Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.
 


Hi Everybody,

I do wish to upgrade to a more recent hbase soon. However the choice isn't entirely mine. Does anybody know how to verify the contents between tables across clusters after a copytable operation?
I see replication.VerifyReplication , but that seems replication specific. Maybe I should have began with replication in the first place...






On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <la...@apache.org> wrote:

Hi Colin,
>
>you might want to consider upgrading. The current stable version is 0.98.4 (soon .5).
>
>Even just going to 0.94 will give a lot of new features, stability, and performance.
>0.92.x can be upgraded to 0.94.x without any downtime and without any upgrade steps necessary.
>For an upgrade to 0.98 and later you'd need some downtime and also excute an upgrade step.
>
>
>-- Lars
>
>
>
>
>----- Original Message -----
>From: Colin Kincaid Williams <di...@uw.edu>
>To: user@hbase.apache.org
>Cc:
>Sent: Friday, August 8, 2014 1:16 PM
>Subject: Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.
>
>Not in the hbase shell I have:
>
>hbase version
>14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
>14/08/08 14:16:08 INFO util.VersionInfo: Subversion
>file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
>-r Unknown
>14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat Jan 26
>17:11:38 PST 2013
>
>
>
>
>
>
>On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Using simplified version of your command, I saw the following in shell
>> output (you may have noticed as well):
>>
>> An argument ignored (unknown or overridden): BLOOMFILTER
>> An argument ignored (unknown or overridden): VERSIONS
>> 0 row(s) in 2.1110 seconds
>>
>> Cheers
>>
>>
>> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <di...@uw.edu>
>> wrote:
>>
>> > I have discovered the error. I made the mistake regarding the compression
>> > and the bloom filter. The new table doesn't have them enabled, and the
>> old
>> > does. However I'm wondering how I can create tables with splits and bf
>> and
>> > compression enabled. Shouldn't the following command return an error?
>> >
>> > hbase(main):001:0> create 'ADMd5','a',{
>> >
>> > hbase(main):002:1* BLOOMFILTER => 'ROW',
>> > hbase(main):003:1* VERSIONS => '1',
>> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
>> > hbase(main):005:1* MIN_VERSIONS => '0',
>> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
>> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
>> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
>> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
>> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
>> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
>> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
>> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
>> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
>> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
>> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
>> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
>> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
>> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
>> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
>> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
>> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
>> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
>> > 0 row(s) in 1.8010 seconds
>> >
>> > hbase(main):024:0> describe 'ADMd5'
>> > DESCRIPTION                                        ENABLED
>> >
>> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
>> >
>> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
>> >
>> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
>> >
>> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
>> >
>> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
>> >
>> > 1 row(s) in 0.0420 seconds
>> >
>> >
>> >
>> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org
>> > > wrote:
>> >
>> > > Hi Colin,
>> > >
>> > > Just to make sure.
>> > >
>> > > Is table A from the source cluster and not compressed, and table B in
>> the
>> > > destination cluster and SNAPPY compressed? Is that correct? Then ratio
>> > > should be the opposite. Are you able to du -h from hadoop to see if all
>> > > regions are evenly bigger or if anything else is wrong?
>> > >
>> > >
>> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <di...@uw.edu>:
>> > >
>> > > > I haven't yet tried to major compact table B. I will look up some
>> > > > documentation on WALs and snapshots to find this information in the
>> > hdfs
>> > > > filesystem tomorrow. Could it be caused by the bloomfilter existing
>> on
>> > > > table B, but not table A? The funny thing is the source table is
>> > smaller
>> > > > than the destination.
>> > > >
>> > > >
>> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
>> > esteban@cloudera.com>
>> > > > wrote:
>> > > >
>> > > > > Hi Colin,
>> > > > >
>> > > > > Have you verified if the content of /a_d includes WALs and/or the
>> > > content
>> > > > > of the snapshots or the HBase archive? have you tried to major
>> > compact
>> > > > > table B?  does it makes any difference?
>> > > > >
>> > > > > regards,
>> > > > > esteban.
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Cloudera, Inc.
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
>> > discord@uw.edu
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > I used the copy table command to copy a database between the
>> > original
>> > > > > > cluster A and a new cluster B. I have noticed that the rootdir is
>> > > > larger
>> > > > > > than 2X the size of the original. I am trying to account for
>> such a
>> > > > large
>> > > > > > difference. The following are some details about the table.
>> > > > > >
>> > > > > >
>> > > > > > I'm trying to figure out why my copied table is more than 2X the
>> > size
>> > > > of
>> > > > > > the original table. Could the bloomfilter itself account for
>> this?
>> > > > > >
>> > > > > > The guide I used as a reference:
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > Supposedly the original command used to create the table on
>> cluster
>> > > A:
>> > > > > >
>> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS =>
>> > '1',
>> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
>> > > > > >
>> > > > > >
>> > > > > > How I created the target table on cluster B:
>> > > > > >
>> > > > > > create 'ADMd5','a',{
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > BLOOMFILTER => 'ROW',
>> > > > > > VERSIONS => '1',
>> > > > > > COMPRESSION => 'SNAPPY',
>> > > > > > MIN_VERSIONS => '0',
>> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
>> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
>> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
>> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
>> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
>> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
>> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
>> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
>> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
>> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
>> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
>> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
>> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
>> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
>> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
>> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
>> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
>> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
>> > > > > >
>> > > > > >
>> > > > > > How the tables now appear in hbase shell:
>> > > > > >
>> > > > > > table A:
>> > > > > >
>> > > > > > describe 'ADMd5'
>> > > > > > DESCRIPTION
>> > > > > >
>> > > > > >   ENABLED
>> > > > > >
>> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
>> > 'NONE',
>> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
>> > > > MIN_VER
>> > > > > > true
>> > > > > >
>> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
>> IN_MEMORY
>> > > =>
>> > > > > > 'false', BLOCKCACHE => 'true'}]}
>> > > > > >
>> > > > > >
>> > > > > > 1 row(s) in 0.0370 seconds
>> > > > > >
>> > > > > >
>> > > > > > table B:
>> > > > > >
>> > > > > > hbase(main):003:0> describe 'ADMd5'
>> > > > > > DESCRIPTION
>> > > > > >
>> > > > > >   ENABLED
>> > > > > >
>> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
>> 'ROW',
>> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>
>> 'SNAPPY',
>> > > > > MIN_VE
>> > > > > > true
>> > > > > >
>> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
>> > IN_MEMORY
>> > > =>
>> > > > > > 'false', BLOCKCACHE => 'true'}]}
>> > > > > >
>> > > > > >
>> > > > > > 1 row(s) in 0.0280 seconds
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > The containing foldersize in hdfs:
>> > > > > > table A:
>> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
>> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
>> > > > > > 227.4g  /a_d
>> > > > > >
>> > > > > > table B:
>> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
>> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
>> > > > > > 501.0g  /a_d
>> > > > > >
>> > > > > >
>> > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Colin Kincaid Williams <di...@uw.edu>.
Hi Everybody,

I do wish to upgrade to a more recent hbase soon. However the choice isn't
entirely mine. Does anybody know how to verify the contents between tables
across clusters after a copytable operation?
I see replication.VerifyReplication , but that seems replication specific.
Maybe I should have began with replication in the first place...


On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <la...@apache.org> wrote:

> Hi Colin,
>
> you might want to consider upgrading. The current stable version is 0.98.4
> (soon .5).
>
> Even just going to 0.94 will give a lot of new features, stability, and
> performance.
> 0.92.x can be upgraded to 0.94.x without any downtime and without any
> upgrade steps necessary.
> For an upgrade to 0.98 and later you'd need some downtime and also excute
> an upgrade step.
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Colin Kincaid Williams <di...@uw.edu>
> To: user@hbase.apache.org
> Cc:
> Sent: Friday, August 8, 2014 1:16 PM
> Subject: Re: Large discrepancy in hdfs hbase rootdir size after copytable
> operation.
>
> Not in the hbase shell I have:
>
> hbase version
> 14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
> 14/08/08 14:16:08 INFO util.VersionInfo: Subversion
>
> file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
> -r Unknown
> 14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat Jan 26
> 17:11:38 PST 2013
>
>
>
>
>
>
> On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Using simplified version of your command, I saw the following in shell
> > output (you may have noticed as well):
> >
> > An argument ignored (unknown or overridden): BLOOMFILTER
> > An argument ignored (unknown or overridden): VERSIONS
> > 0 row(s) in 2.1110 seconds
> >
> > Cheers
> >
> >
> > On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <di...@uw.edu>
> > wrote:
> >
> > > I have discovered the error. I made the mistake regarding the
> compression
> > > and the bloom filter. The new table doesn't have them enabled, and the
> > old
> > > does. However I'm wondering how I can create tables with splits and bf
> > and
> > > compression enabled. Shouldn't the following command return an error?
> > >
> > > hbase(main):001:0> create 'ADMd5','a',{
> > >
> > > hbase(main):002:1* BLOOMFILTER => 'ROW',
> > > hbase(main):003:1* VERSIONS => '1',
> > > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> > > hbase(main):005:1* MIN_VERSIONS => '0',
> > > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> > > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> > > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> > > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> > > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> > > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> > > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> > > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> > > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> > > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> > > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> > > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> > > 0 row(s) in 1.8010 seconds
> > >
> > > hbase(main):024:0> describe 'ADMd5'
> > > DESCRIPTION                                        ENABLED
> > >
> > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> > >
> > >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> > >
> > >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> > >
> > >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> > >
> > >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> > >
> > > 1 row(s) in 0.0420 seconds
> > >
> > >
> > >
> > > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org
> > > > wrote:
> > >
> > > > Hi Colin,
> > > >
> > > > Just to make sure.
> > > >
> > > > Is table A from the source cluster and not compressed, and table B in
> > the
> > > > destination cluster and SNAPPY compressed? Is that correct? Then
> ratio
> > > > should be the opposite. Are you able to du -h from hadoop to see if
> all
> > > > regions are evenly bigger or if anything else is wrong?
> > > >
> > > >
> > > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <di...@uw.edu>:
> > > >
> > > > > I haven't yet tried to major compact table B. I will look up some
> > > > > documentation on WALs and snapshots to find this information in the
> > > hdfs
> > > > > filesystem tomorrow. Could it be caused by the bloomfilter existing
> > on
> > > > > table B, but not table A? The funny thing is the source table is
> > > smaller
> > > > > than the destination.
> > > > >
> > > > >
> > > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> > > esteban@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Colin,
> > > > > >
> > > > > > Have you verified if the content of /a_d includes WALs and/or the
> > > > content
> > > > > > of the snapshots or the HBase archive? have you tried to major
> > > compact
> > > > > > table B?  does it makes any difference?
> > > > > >
> > > > > > regards,
> > > > > > esteban.
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Cloudera, Inc.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
> > > discord@uw.edu
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I used the copy table command to copy a database between the
> > > original
> > > > > > > cluster A and a new cluster B. I have noticed that the rootdir
> is
> > > > > larger
> > > > > > > than 2X the size of the original. I am trying to account for
> > such a
> > > > > large
> > > > > > > difference. The following are some details about the table.
> > > > > > >
> > > > > > >
> > > > > > > I'm trying to figure out why my copied table is more than 2X
> the
> > > size
> > > > > of
> > > > > > > the original table. Could the bloomfilter itself account for
> > this?
> > > > > > >
> > > > > > > The guide I used as a reference:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Supposedly the original command used to create the table on
> > cluster
> > > > A:
> > > > > > >
> > > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS =>
> > > '1',
> > > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > > > > > >
> > > > > > >
> > > > > > > How I created the target table on cluster B:
> > > > > > >
> > > > > > > create 'ADMd5','a',{
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > BLOOMFILTER => 'ROW',
> > > > > > > VERSIONS => '1',
> > > > > > > COMPRESSION => 'SNAPPY',
> > > > > > > MIN_VERSIONS => '0',
> > > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> > > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > > > > > >
> > > > > > >
> > > > > > > How the tables now appear in hbase shell:
> > > > > > >
> > > > > > > table A:
> > > > > > >
> > > > > > > describe 'ADMd5'
> > > > > > > DESCRIPTION
> > > > > > >
> > > > > > >   ENABLED
> > > > > > >
> > > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> > > 'NONE',
> > > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION =>
> 'NONE',
> > > > > MIN_VER
> > > > > > > true
> > > > > > >
> > > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> > IN_MEMORY
> > > > =>
> > > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > > > >
> > > > > > >
> > > > > > > 1 row(s) in 0.0370 seconds
> > > > > > >
> > > > > > >
> > > > > > > table B:
> > > > > > >
> > > > > > > hbase(main):003:0> describe 'ADMd5'
> > > > > > > DESCRIPTION
> > > > > > >
> > > > > > >   ENABLED
> > > > > > >
> > > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> > 'ROW',
> > > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>
> > 'SNAPPY',
> > > > > > MIN_VE
> > > > > > > true
> > > > > > >
> > > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> > > IN_MEMORY
> > > > =>
> > > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > > > >
> > > > > > >
> > > > > > > 1 row(s) in 0.0280 seconds
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The containing foldersize in hdfs:
> > > > > > > table A:
> > > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > > > 227.4g  /a_d
> > > > > > >
> > > > > > > table B:
> > > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > > > 501.0g  /a_d
> > > > > > >
> > > > > > >
> > > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by lars hofhansl <la...@apache.org>.
Hi Colin,

you might want to consider upgrading. The current stable version is 0.98.4 (soon .5).

Even just going to 0.94 will give a lot of new features, stability, and performance.
0.92.x can be upgraded to 0.94.x without any downtime and without any upgrade steps necessary.
For an upgrade to 0.98 and later you'd need some downtime and also excute an upgrade step.


-- Lars



----- Original Message -----
From: Colin Kincaid Williams <di...@uw.edu>
To: user@hbase.apache.org
Cc: 
Sent: Friday, August 8, 2014 1:16 PM
Subject: Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Not in the hbase shell I have:

hbase version
14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
14/08/08 14:16:08 INFO util.VersionInfo: Subversion
file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
-r Unknown
14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat Jan 26
17:11:38 PST 2013






On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com> wrote:

> Using simplified version of your command, I saw the following in shell
> output (you may have noticed as well):
>
> An argument ignored (unknown or overridden): BLOOMFILTER
> An argument ignored (unknown or overridden): VERSIONS
> 0 row(s) in 2.1110 seconds
>
> Cheers
>
>
> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
> > I have discovered the error. I made the mistake regarding the compression
> > and the bloom filter. The new table doesn't have them enabled, and the
> old
> > does. However I'm wondering how I can create tables with splits and bf
> and
> > compression enabled. Shouldn't the following command return an error?
> >
> > hbase(main):001:0> create 'ADMd5','a',{
> >
> > hbase(main):002:1* BLOOMFILTER => 'ROW',
> > hbase(main):003:1* VERSIONS => '1',
> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> > hbase(main):005:1* MIN_VERSIONS => '0',
> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> > 0 row(s) in 1.8010 seconds
> >
> > hbase(main):024:0> describe 'ADMd5'
> > DESCRIPTION                                        ENABLED
> >
> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> >
> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> >
> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> >
> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> >
> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> >
> > 1 row(s) in 0.0420 seconds
> >
> >
> >
> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org
> > > wrote:
> >
> > > Hi Colin,
> > >
> > > Just to make sure.
> > >
> > > Is table A from the source cluster and not compressed, and table B in
> the
> > > destination cluster and SNAPPY compressed? Is that correct? Then ratio
> > > should be the opposite. Are you able to du -h from hadoop to see if all
> > > regions are evenly bigger or if anything else is wrong?
> > >
> > >
> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <di...@uw.edu>:
> > >
> > > > I haven't yet tried to major compact table B. I will look up some
> > > > documentation on WALs and snapshots to find this information in the
> > hdfs
> > > > filesystem tomorrow. Could it be caused by the bloomfilter existing
> on
> > > > table B, but not table A? The funny thing is the source table is
> > smaller
> > > > than the destination.
> > > >
> > > >
> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> > esteban@cloudera.com>
> > > > wrote:
> > > >
> > > > > Hi Colin,
> > > > >
> > > > > Have you verified if the content of /a_d includes WALs and/or the
> > > content
> > > > > of the snapshots or the HBase archive? have you tried to major
> > compact
> > > > > table B?  does it makes any difference?
> > > > >
> > > > > regards,
> > > > > esteban.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Cloudera, Inc.
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
> > discord@uw.edu
> > > >
> > > > > wrote:
> > > > >
> > > > > > I used the copy table command to copy a database between the
> > original
> > > > > > cluster A and a new cluster B. I have noticed that the rootdir is
> > > > larger
> > > > > > than 2X the size of the original. I am trying to account for
> such a
> > > > large
> > > > > > difference. The following are some details about the table.
> > > > > >
> > > > > >
> > > > > > I'm trying to figure out why my copied table is more than 2X the
> > size
> > > > of
> > > > > > the original table. Could the bloomfilter itself account for
> this?
> > > > > >
> > > > > > The guide I used as a reference:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > > > > >
> > > > > >
> > > > > >
> > > > > > Supposedly the original command used to create the table on
> cluster
> > > A:
> > > > > >
> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS =>
> > '1',
> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > > > > >
> > > > > >
> > > > > > How I created the target table on cluster B:
> > > > > >
> > > > > > create 'ADMd5','a',{
> > > > > >
> > > > > >
> > > > > >
> > > > > > BLOOMFILTER => 'ROW',
> > > > > > VERSIONS => '1',
> > > > > > COMPRESSION => 'SNAPPY',
> > > > > > MIN_VERSIONS => '0',
> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > > > > >
> > > > > >
> > > > > > How the tables now appear in hbase shell:
> > > > > >
> > > > > > table A:
> > > > > >
> > > > > > describe 'ADMd5'
> > > > > > DESCRIPTION
> > > > > >
> > > > > >   ENABLED
> > > > > >
> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> > 'NONE',
> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
> > > > MIN_VER
> > > > > > true
> > > > > >
> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY
> > > =>
> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > > >
> > > > > >
> > > > > > 1 row(s) in 0.0370 seconds
> > > > > >
> > > > > >
> > > > > > table B:
> > > > > >
> > > > > > hbase(main):003:0> describe 'ADMd5'
> > > > > > DESCRIPTION
> > > > > >
> > > > > >   ENABLED
> > > > > >
> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> 'ROW',
> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>
> 'SNAPPY',
> > > > > MIN_VE
> > > > > > true
> > > > > >
> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> > IN_MEMORY
> > > =>
> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > > >
> > > > > >
> > > > > > 1 row(s) in 0.0280 seconds
> > > > > >
> > > > > >
> > > > > >
> > > > > > The containing foldersize in hdfs:
> > > > > > table A:
> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > > 227.4g  /a_d
> > > > > >
> > > > > > table B:
> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > > 501.0g  /a_d
> > > > > >
> > > > > >
> > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Colin Kincaid Williams <di...@uw.edu>.
Not in the hbase shell I have:

hbase version
14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
14/08/08 14:16:08 INFO util.VersionInfo: Subversion
file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
-r Unknown
14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat Jan 26
17:11:38 PST 2013



On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <yu...@gmail.com> wrote:

> Using simplified version of your command, I saw the following in shell
> output (you may have noticed as well):
>
> An argument ignored (unknown or overridden): BLOOMFILTER
> An argument ignored (unknown or overridden): VERSIONS
> 0 row(s) in 2.1110 seconds
>
> Cheers
>
>
> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
> > I have discovered the error. I made the mistake regarding the compression
> > and the bloom filter. The new table doesn't have them enabled, and the
> old
> > does. However I'm wondering how I can create tables with splits and bf
> and
> > compression enabled. Shouldn't the following command return an error?
> >
> > hbase(main):001:0> create 'ADMd5','a',{
> >
> > hbase(main):002:1* BLOOMFILTER => 'ROW',
> > hbase(main):003:1* VERSIONS => '1',
> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> > hbase(main):005:1* MIN_VERSIONS => '0',
> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> > 0 row(s) in 1.8010 seconds
> >
> > hbase(main):024:0> describe 'ADMd5'
> > DESCRIPTION                                        ENABLED
> >
> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> >
> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> >
> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> >
> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> >
> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> >
> > 1 row(s) in 0.0420 seconds
> >
> >
> >
> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org
> > > wrote:
> >
> > > Hi Colin,
> > >
> > > Just to make sure.
> > >
> > > Is table A from the source cluster and not compressed, and table B in
> the
> > > destination cluster and SNAPPY compressed? Is that correct? Then ratio
> > > should be the opposite. Are you able to du -h from hadoop to see if all
> > > regions are evenly bigger or if anything else is wrong?
> > >
> > >
> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <di...@uw.edu>:
> > >
> > > > I haven't yet tried to major compact table B. I will look up some
> > > > documentation on WALs and snapshots to find this information in the
> > hdfs
> > > > filesystem tomorrow. Could it be caused by the bloomfilter existing
> on
> > > > table B, but not table A? The funny thing is the source table is
> > smaller
> > > > than the destination.
> > > >
> > > >
> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> > esteban@cloudera.com>
> > > > wrote:
> > > >
> > > > > Hi Colin,
> > > > >
> > > > > Have you verified if the content of /a_d includes WALs and/or the
> > > content
> > > > > of the snapshots or the HBase archive? have you tried to major
> > compact
> > > > > table B?  does it makes any difference?
> > > > >
> > > > > regards,
> > > > > esteban.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Cloudera, Inc.
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
> > discord@uw.edu
> > > >
> > > > > wrote:
> > > > >
> > > > > > I used the copy table command to copy a database between the
> > original
> > > > > > cluster A and a new cluster B. I have noticed that the rootdir is
> > > > larger
> > > > > > than 2X the size of the original. I am trying to account for
> such a
> > > > large
> > > > > > difference. The following are some details about the table.
> > > > > >
> > > > > >
> > > > > > I'm trying to figure out why my copied table is more than 2X the
> > size
> > > > of
> > > > > > the original table. Could the bloomfilter itself account for
> this?
> > > > > >
> > > > > > The guide I used as a reference:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > > > > >
> > > > > >
> > > > > >
> > > > > > Supposedly the original command used to create the table on
> cluster
> > > A:
> > > > > >
> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS =>
> > '1',
> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > > > > >
> > > > > >
> > > > > > How I created the target table on cluster B:
> > > > > >
> > > > > > create 'ADMd5','a',{
> > > > > >
> > > > > >
> > > > > >
> > > > > > BLOOMFILTER => 'ROW',
> > > > > > VERSIONS => '1',
> > > > > > COMPRESSION => 'SNAPPY',
> > > > > > MIN_VERSIONS => '0',
> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > > > > >
> > > > > >
> > > > > > How the tables now appear in hbase shell:
> > > > > >
> > > > > > table A:
> > > > > >
> > > > > > describe 'ADMd5'
> > > > > > DESCRIPTION
> > > > > >
> > > > > >   ENABLED
> > > > > >
> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> > 'NONE',
> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
> > > > MIN_VER
> > > > > > true
> > > > > >
> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY
> > > =>
> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > > >
> > > > > >
> > > > > > 1 row(s) in 0.0370 seconds
> > > > > >
> > > > > >
> > > > > > table B:
> > > > > >
> > > > > > hbase(main):003:0> describe 'ADMd5'
> > > > > > DESCRIPTION
> > > > > >
> > > > > >   ENABLED
> > > > > >
> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> 'ROW',
> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>
> 'SNAPPY',
> > > > > MIN_VE
> > > > > > true
> > > > > >
> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> > IN_MEMORY
> > > =>
> > > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > > >
> > > > > >
> > > > > > 1 row(s) in 0.0280 seconds
> > > > > >
> > > > > >
> > > > > >
> > > > > > The containing foldersize in hdfs:
> > > > > > table A:
> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > > 227.4g  /a_d
> > > > > >
> > > > > > table B:
> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > > 501.0g  /a_d
> > > > > >
> > > > > >
> > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Ted Yu <yu...@gmail.com>.
Using simplified version of your command, I saw the following in shell
output (you may have noticed as well):

An argument ignored (unknown or overridden): BLOOMFILTER
An argument ignored (unknown or overridden): VERSIONS
0 row(s) in 2.1110 seconds

Cheers


On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> I have discovered the error. I made the mistake regarding the compression
> and the bloom filter. The new table doesn't have them enabled, and the old
> does. However I'm wondering how I can create tables with splits and bf and
> compression enabled. Shouldn't the following command return an error?
>
> hbase(main):001:0> create 'ADMd5','a',{
>
> hbase(main):002:1* BLOOMFILTER => 'ROW',
> hbase(main):003:1* VERSIONS => '1',
> hbase(main):004:1* COMPRESSION => 'SNAPPY',
> hbase(main):005:1* MIN_VERSIONS => '0',
> hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> 0 row(s) in 1.8010 seconds
>
> hbase(main):024:0> describe 'ADMd5'
> DESCRIPTION                                        ENABLED
>
>  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
>
>  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
>
>  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
>
>  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
>
>  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
>
> 1 row(s) in 0.0420 seconds
>
>
>
> On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org
> > wrote:
>
> > Hi Colin,
> >
> > Just to make sure.
> >
> > Is table A from the source cluster and not compressed, and table B in the
> > destination cluster and SNAPPY compressed? Is that correct? Then ratio
> > should be the opposite. Are you able to du -h from hadoop to see if all
> > regions are evenly bigger or if anything else is wrong?
> >
> >
> > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <di...@uw.edu>:
> >
> > > I haven't yet tried to major compact table B. I will look up some
> > > documentation on WALs and snapshots to find this information in the
> hdfs
> > > filesystem tomorrow. Could it be caused by the bloomfilter existing on
> > > table B, but not table A? The funny thing is the source table is
> smaller
> > > than the destination.
> > >
> > >
> > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> esteban@cloudera.com>
> > > wrote:
> > >
> > > > Hi Colin,
> > > >
> > > > Have you verified if the content of /a_d includes WALs and/or the
> > content
> > > > of the snapshots or the HBase archive? have you tried to major
> compact
> > > > table B?  does it makes any difference?
> > > >
> > > > regards,
> > > > esteban.
> > > >
> > > >
> > > >
> > > > --
> > > > Cloudera, Inc.
> > > >
> > > >
> > > >
> > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
> discord@uw.edu
> > >
> > > > wrote:
> > > >
> > > > > I used the copy table command to copy a database between the
> original
> > > > > cluster A and a new cluster B. I have noticed that the rootdir is
> > > larger
> > > > > than 2X the size of the original. I am trying to account for such a
> > > large
> > > > > difference. The following are some details about the table.
> > > > >
> > > > >
> > > > > I'm trying to figure out why my copied table is more than 2X the
> size
> > > of
> > > > > the original table. Could the bloomfilter itself account for this?
> > > > >
> > > > > The guide I used as a reference:
> > > > >
> > > > >
> > > >
> > >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > > > >
> > > > >
> > > > >
> > > > > Supposedly the original command used to create the table on cluster
> > A:
> > > > >
> > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS =>
> '1',
> > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > > > >
> > > > >
> > > > > How I created the target table on cluster B:
> > > > >
> > > > > create 'ADMd5','a',{
> > > > >
> > > > >
> > > > >
> > > > > BLOOMFILTER => 'ROW',
> > > > > VERSIONS => '1',
> > > > > COMPRESSION => 'SNAPPY',
> > > > > MIN_VERSIONS => '0',
> > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > > > > '0sZYnBd83ul58d1O8I2JnA==',
> > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > > > >
> > > > >
> > > > > How the tables now appear in hbase shell:
> > > > >
> > > > > table A:
> > > > >
> > > > > describe 'ADMd5'
> > > > > DESCRIPTION
> > > > >
> > > > >   ENABLED
> > > > >
> > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER =>
> 'NONE',
> > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
> > > MIN_VER
> > > > > true
> > > > >
> > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY
> > =>
> > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > >
> > > > >
> > > > > 1 row(s) in 0.0370 seconds
> > > > >
> > > > >
> > > > > table B:
> > > > >
> > > > > hbase(main):003:0> describe 'ADMd5'
> > > > > DESCRIPTION
> > > > >
> > > > >   ENABLED
> > > > >
> > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'ROW',
> > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY',
> > > > MIN_VE
> > > > > true
> > > > >
> > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY
> > =>
> > > > > 'false', BLOCKCACHE => 'true'}]}
> > > > >
> > > > >
> > > > > 1 row(s) in 0.0280 seconds
> > > > >
> > > > >
> > > > >
> > > > > The containing foldersize in hdfs:
> > > > > table A:
> > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > 227.4g  /a_d
> > > > >
> > > > > table B:
> > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > > 501.0g  /a_d
> > > > >
> > > > >
> > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > > > >
> > > >
> > >
> >
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Colin Kincaid Williams <di...@uw.edu>.
I have discovered the error. I made the mistake regarding the compression
and the bloom filter. The new table doesn't have them enabled, and the old
does. However I'm wondering how I can create tables with splits and bf and
compression enabled. Shouldn't the following command return an error?

hbase(main):001:0> create 'ADMd5','a',{

hbase(main):002:1* BLOOMFILTER => 'ROW',
hbase(main):003:1* VERSIONS => '1',
hbase(main):004:1* COMPRESSION => 'SNAPPY',
hbase(main):005:1* MIN_VERSIONS => '0',
hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
0 row(s) in 1.8010 seconds

hbase(main):024:0> describe 'ADMd5'
DESCRIPTION                                        ENABLED

 {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true

 MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS

 IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS

 => '0', TTL => '2147483647', BLOCKSIZE => '65536'

 , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

1 row(s) in 0.0420 seconds



On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <jean-marc@spaggiari.org
> wrote:

> Hi Colin,
>
> Just to make sure.
>
> Is table A from the source cluster and not compressed, and table B in the
> destination cluster and SNAPPY compressed? Is that correct? Then ratio
> should be the opposite. Are you able to du -h from hadoop to see if all
> regions are evenly bigger or if anything else is wrong?
>
>
> 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <di...@uw.edu>:
>
> > I haven't yet tried to major compact table B. I will look up some
> > documentation on WALs and snapshots to find this information in the hdfs
> > filesystem tomorrow. Could it be caused by the bloomfilter existing on
> > table B, but not table A? The funny thing is the source table is smaller
> > than the destination.
> >
> >
> > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <es...@cloudera.com>
> > wrote:
> >
> > > Hi Colin,
> > >
> > > Have you verified if the content of /a_d includes WALs and/or the
> content
> > > of the snapshots or the HBase archive? have you tried to major compact
> > > table B?  does it makes any difference?
> > >
> > > regards,
> > > esteban.
> > >
> > >
> > >
> > > --
> > > Cloudera, Inc.
> > >
> > >
> > >
> > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <discord@uw.edu
> >
> > > wrote:
> > >
> > > > I used the copy table command to copy a database between the original
> > > > cluster A and a new cluster B. I have noticed that the rootdir is
> > larger
> > > > than 2X the size of the original. I am trying to account for such a
> > large
> > > > difference. The following are some details about the table.
> > > >
> > > >
> > > > I'm trying to figure out why my copied table is more than 2X the size
> > of
> > > > the original table. Could the bloomfilter itself account for this?
> > > >
> > > > The guide I used as a reference:
> > > >
> > > >
> > >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > > >
> > > >
> > > >
> > > > Supposedly the original command used to create the table on cluster
> A:
> > > >
> > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS => '1',
> > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > > >
> > > >
> > > > How I created the target table on cluster B:
> > > >
> > > > create 'ADMd5','a',{
> > > >
> > > >
> > > >
> > > > BLOOMFILTER => 'ROW',
> > > > VERSIONS => '1',
> > > > COMPRESSION => 'SNAPPY',
> > > > MIN_VERSIONS => '0',
> > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > > > '0sZYnBd83ul58d1O8I2JnA==',
> > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > > >
> > > >
> > > > How the tables now appear in hbase shell:
> > > >
> > > > table A:
> > > >
> > > > describe 'ADMd5'
> > > > DESCRIPTION
> > > >
> > > >   ENABLED
> > > >
> > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'NONE',
> > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
> > MIN_VER
> > > > true
> > > >
> > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY
> =>
> > > > 'false', BLOCKCACHE => 'true'}]}
> > > >
> > > >
> > > > 1 row(s) in 0.0370 seconds
> > > >
> > > >
> > > > table B:
> > > >
> > > > hbase(main):003:0> describe 'ADMd5'
> > > > DESCRIPTION
> > > >
> > > >   ENABLED
> > > >
> > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'ROW',
> > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY',
> > > MIN_VE
> > > > true
> > > >
> > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY
> =>
> > > > 'false', BLOCKCACHE => 'true'}]}
> > > >
> > > >
> > > > 1 row(s) in 0.0280 seconds
> > > >
> > > >
> > > >
> > > > The containing foldersize in hdfs:
> > > > table A:
> > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > 227.4g  /a_d
> > > >
> > > > table B:
> > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > 501.0g  /a_d
> > > >
> > > >
> > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > > >
> > >
> >
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by tobe <to...@gmail.com>.
I can't repro this problem when I ran CopyTable. You could just "-du" to
see the sizes of all files.


On Fri, Aug 8, 2014 at 8:50 AM, Jean-Marc Spaggiari <jean-marc@spaggiari.org
> wrote:

> Hi Colin,
>
> Just to make sure.
>
> Is table A from the source cluster and not compressed, and table B in the
> destination cluster and SNAPPY compressed? Is that correct? Then ratio
> should be the opposite. Are you able to du -h from hadoop to see if all
> regions are evenly bigger or if anything else is wrong?
>
>
> 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <di...@uw.edu>:
>
> > I haven't yet tried to major compact table B. I will look up some
> > documentation on WALs and snapshots to find this information in the hdfs
> > filesystem tomorrow. Could it be caused by the bloomfilter existing on
> > table B, but not table A? The funny thing is the source table is smaller
> > than the destination.
> >
> >
> > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <es...@cloudera.com>
> > wrote:
> >
> > > Hi Colin,
> > >
> > > Have you verified if the content of /a_d includes WALs and/or the
> content
> > > of the snapshots or the HBase archive? have you tried to major compact
> > > table B?  does it makes any difference?
> > >
> > > regards,
> > > esteban.
> > >
> > >
> > >
> > > --
> > > Cloudera, Inc.
> > >
> > >
> > >
> > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <discord@uw.edu
> >
> > > wrote:
> > >
> > > > I used the copy table command to copy a database between the original
> > > > cluster A and a new cluster B. I have noticed that the rootdir is
> > larger
> > > > than 2X the size of the original. I am trying to account for such a
> > large
> > > > difference. The following are some details about the table.
> > > >
> > > >
> > > > I'm trying to figure out why my copied table is more than 2X the size
> > of
> > > > the original table. Could the bloomfilter itself account for this?
> > > >
> > > > The guide I used as a reference:
> > > >
> > > >
> > >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > > >
> > > >
> > > >
> > > > Supposedly the original command used to create the table on cluster
> A:
> > > >
> > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS => '1',
> > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > > >
> > > >
> > > > How I created the target table on cluster B:
> > > >
> > > > create 'ADMd5','a',{
> > > >
> > > >
> > > >
> > > > BLOOMFILTER => 'ROW',
> > > > VERSIONS => '1',
> > > > COMPRESSION => 'SNAPPY',
> > > > MIN_VERSIONS => '0',
> > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > > > '0sZYnBd83ul58d1O8I2JnA==',
> > > > '2+03N7IicZH3ltrqZUX6kQ==',
> > > > '4+/slRQtkBDU7Px6C9MAbg==',
> > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > > > 'A+4ajStFkjEMf36cX5D9xg==',
> > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > > >
> > > >
> > > > How the tables now appear in hbase shell:
> > > >
> > > > table A:
> > > >
> > > > describe 'ADMd5'
> > > > DESCRIPTION
> > > >
> > > >   ENABLED
> > > >
> > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'NONE',
> > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
> > MIN_VER
> > > > true
> > > >
> > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY
> =>
> > > > 'false', BLOCKCACHE => 'true'}]}
> > > >
> > > >
> > > > 1 row(s) in 0.0370 seconds
> > > >
> > > >
> > > > table B:
> > > >
> > > > hbase(main):003:0> describe 'ADMd5'
> > > > DESCRIPTION
> > > >
> > > >   ENABLED
> > > >
> > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'ROW',
> > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY',
> > > MIN_VE
> > > > true
> > > >
> > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY
> =>
> > > > 'false', BLOCKCACHE => 'true'}]}
> > > >
> > > >
> > > > 1 row(s) in 0.0280 seconds
> > > >
> > > >
> > > >
> > > > The containing foldersize in hdfs:
> > > > table A:
> > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > 227.4g  /a_d
> > > >
> > > > table B:
> > > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > > dus: DEPRECATED: Please use 'du -s' instead.
> > > > 501.0g  /a_d
> > > >
> > > >
> > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > > >
> > >
> >
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Colin,

Just to make sure.

Is table A from the source cluster and not compressed, and table B in the
destination cluster and SNAPPY compressed? Is that correct? Then ratio
should be the opposite. Are you able to du -h from hadoop to see if all
regions are evenly bigger or if anything else is wrong?


2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <di...@uw.edu>:

> I haven't yet tried to major compact table B. I will look up some
> documentation on WALs and snapshots to find this information in the hdfs
> filesystem tomorrow. Could it be caused by the bloomfilter existing on
> table B, but not table A? The funny thing is the source table is smaller
> than the destination.
>
>
> On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <es...@cloudera.com>
> wrote:
>
> > Hi Colin,
> >
> > Have you verified if the content of /a_d includes WALs and/or the content
> > of the snapshots or the HBase archive? have you tried to major compact
> > table B?  does it makes any difference?
> >
> > regards,
> > esteban.
> >
> >
> >
> > --
> > Cloudera, Inc.
> >
> >
> >
> > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <di...@uw.edu>
> > wrote:
> >
> > > I used the copy table command to copy a database between the original
> > > cluster A and a new cluster B. I have noticed that the rootdir is
> larger
> > > than 2X the size of the original. I am trying to account for such a
> large
> > > difference. The following are some details about the table.
> > >
> > >
> > > I'm trying to figure out why my copied table is more than 2X the size
> of
> > > the original table. Could the bloomfilter itself account for this?
> > >
> > > The guide I used as a reference:
> > >
> > >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> > >
> > >
> > >
> > > Supposedly the original command used to create the table on cluster A:
> > >
> > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS => '1',
> > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> > >
> > >
> > > How I created the target table on cluster B:
> > >
> > > create 'ADMd5','a',{
> > >
> > >
> > >
> > > BLOOMFILTER => 'ROW',
> > > VERSIONS => '1',
> > > COMPRESSION => 'SNAPPY',
> > > MIN_VERSIONS => '0',
> > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > > '/zyuFR1VmhJyF4rbWsFnEg==',
> > > '0sZYnBd83ul58d1O8I2JnA==',
> > > '2+03N7IicZH3ltrqZUX6kQ==',
> > > '4+/slRQtkBDU7Px6C9MAbg==',
> > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > > '7+2pvtpHUQHWkZJoouR9wQ==',
> > > '8+4n2deXhzmrpe//2Fo6Fg==',
> > > '9+4SKW/BmNzpL68cXwKV1Q==',
> > > 'A+4ajStFkjEMf36cX5D9xg==',
> > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > > 'C+6lKKDiOWl5qrRn72fNCw==',
> > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > > 'E+6BrimmrpAd92gZJ5hyMw==',
> > > 'G+5tisu4xWZMOJnDHeYBJg==',
> > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > > 'K+7tenLYn6a1aNLniL6tbg==']}
> > >
> > >
> > > How the tables now appear in hbase shell:
> > >
> > > table A:
> > >
> > > describe 'ADMd5'
> > > DESCRIPTION
> > >
> > >   ENABLED
> > >
> > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'NONE',
> > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
> MIN_VER
> > > true
> > >
> > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>
> > > 'false', BLOCKCACHE => 'true'}]}
> > >
> > >
> > > 1 row(s) in 0.0370 seconds
> > >
> > >
> > > table B:
> > >
> > > hbase(main):003:0> describe 'ADMd5'
> > > DESCRIPTION
> > >
> > >   ENABLED
> > >
> > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'ROW',
> > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY',
> > MIN_VE
> > > true
> > >
> > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>
> > > 'false', BLOCKCACHE => 'true'}]}
> > >
> > >
> > > 1 row(s) in 0.0280 seconds
> > >
> > >
> > >
> > > The containing foldersize in hdfs:
> > > table A:
> > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > dus: DEPRECATED: Please use 'du -s' instead.
> > > 227.4g  /a_d
> > >
> > > table B:
> > > sudo -u hdfs hadoop fs -dus -h /a_d
> > > dus: DEPRECATED: Please use 'du -s' instead.
> > > 501.0g  /a_d
> > >
> > >
> > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> > >
> >
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Colin Kincaid Williams <di...@uw.edu>.
I haven't yet tried to major compact table B. I will look up some
documentation on WALs and snapshots to find this information in the hdfs
filesystem tomorrow. Could it be caused by the bloomfilter existing on
table B, but not table A? The funny thing is the source table is smaller
than the destination.


On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <es...@cloudera.com>
wrote:

> Hi Colin,
>
> Have you verified if the content of /a_d includes WALs and/or the content
> of the snapshots or the HBase archive? have you tried to major compact
> table B?  does it makes any difference?
>
> regards,
> esteban.
>
>
>
> --
> Cloudera, Inc.
>
>
>
> On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
> > I used the copy table command to copy a database between the original
> > cluster A and a new cluster B. I have noticed that the rootdir is larger
> > than 2X the size of the original. I am trying to account for such a large
> > difference. The following are some details about the table.
> >
> >
> > I'm trying to figure out why my copied table is more than 2X the size of
> > the original table. Could the bloomfilter itself account for this?
> >
> > The guide I used as a reference:
> >
> >
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> >
> >
> >
> > Supposedly the original command used to create the table on cluster A:
> >
> > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS => '1',
> > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> >
> >
> > How I created the target table on cluster B:
> >
> > create 'ADMd5','a',{
> >
> >
> >
> > BLOOMFILTER => 'ROW',
> > VERSIONS => '1',
> > COMPRESSION => 'SNAPPY',
> > MIN_VERSIONS => '0',
> > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> > '/zyuFR1VmhJyF4rbWsFnEg==',
> > '0sZYnBd83ul58d1O8I2JnA==',
> > '2+03N7IicZH3ltrqZUX6kQ==',
> > '4+/slRQtkBDU7Px6C9MAbg==',
> > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> > '7+2pvtpHUQHWkZJoouR9wQ==',
> > '8+4n2deXhzmrpe//2Fo6Fg==',
> > '9+4SKW/BmNzpL68cXwKV1Q==',
> > 'A+4ajStFkjEMf36cX5D9xg==',
> > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> > 'C+6lKKDiOWl5qrRn72fNCw==',
> > 'D+6dZMyn7m+NhJ7G07gqaw==',
> > 'E+6BrimmrpAd92gZJ5hyMw==',
> > 'G+5tisu4xWZMOJnDHeYBJg==',
> > 'I+7fRy4dvqcM/L6dFRQk9g==',
> > 'J+8ECMw1zeOyjfOg/ypXJA==',
> > 'K+7tenLYn6a1aNLniL6tbg==']}
> >
> >
> > How the tables now appear in hbase shell:
> >
> > table A:
> >
> > describe 'ADMd5'
> > DESCRIPTION
> >
> >   ENABLED
> >
> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'NONE',
> > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VER
> > true
> >
> >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>
> > 'false', BLOCKCACHE => 'true'}]}
> >
> >
> > 1 row(s) in 0.0370 seconds
> >
> >
> > table B:
> >
> > hbase(main):003:0> describe 'ADMd5'
> > DESCRIPTION
> >
> >   ENABLED
> >
> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'ROW',
> > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY',
> MIN_VE
> > true
> >
> >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>
> > 'false', BLOCKCACHE => 'true'}]}
> >
> >
> > 1 row(s) in 0.0280 seconds
> >
> >
> >
> > The containing foldersize in hdfs:
> > table A:
> > sudo -u hdfs hadoop fs -dus -h /a_d
> > dus: DEPRECATED: Please use 'du -s' instead.
> > 227.4g  /a_d
> >
> > table B:
> > sudo -u hdfs hadoop fs -dus -h /a_d
> > dus: DEPRECATED: Please use 'du -s' instead.
> > 501.0g  /a_d
> >
> >
> > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> >
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Posted by Esteban Gutierrez <es...@cloudera.com>.
Hi Colin,

Have you verified if the content of /a_d includes WALs and/or the content
of the snapshots or the HBase archive? have you tried to major compact
table B?  does it makes any difference?

regards,
esteban.



--
Cloudera, Inc.



On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> I used the copy table command to copy a database between the original
> cluster A and a new cluster B. I have noticed that the rootdir is larger
> than 2X the size of the original. I am trying to account for such a large
> difference. The following are some details about the table.
>
>
> I'm trying to figure out why my copied table is more than 2X the size of
> the original table. Could the bloomfilter itself account for this?
>
> The guide I used as a reference:
>
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
>
>
>
> Supposedly the original command used to create the table on cluster A:
>
> create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW', VERSIONS => '1',
> COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
>
>
> How I created the target table on cluster B:
>
> create 'ADMd5','a',{
>
>
>
> BLOOMFILTER => 'ROW',
> VERSIONS => '1',
> COMPRESSION => 'SNAPPY',
> MIN_VERSIONS => '0',
> SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> '/zyuFR1VmhJyF4rbWsFnEg==',
> '0sZYnBd83ul58d1O8I2JnA==',
> '2+03N7IicZH3ltrqZUX6kQ==',
> '4+/slRQtkBDU7Px6C9MAbg==',
> '6+1dGCQ/IBrCsrNQXe/9xQ==',
> '7+2pvtpHUQHWkZJoouR9wQ==',
> '8+4n2deXhzmrpe//2Fo6Fg==',
> '9+4SKW/BmNzpL68cXwKV1Q==',
> 'A+4ajStFkjEMf36cX5D9xg==',
> 'B+6Zm6Kccb3l6iM2L0epxQ==',
> 'C+6lKKDiOWl5qrRn72fNCw==',
> 'D+6dZMyn7m+NhJ7G07gqaw==',
> 'E+6BrimmrpAd92gZJ5hyMw==',
> 'G+5tisu4xWZMOJnDHeYBJg==',
> 'I+7fRy4dvqcM/L6dFRQk9g==',
> 'J+8ECMw1zeOyjfOg/ypXJA==',
> 'K+7tenLYn6a1aNLniL6tbg==']}
>
>
> How the tables now appear in hbase shell:
>
> table A:
>
> describe 'ADMd5'
> DESCRIPTION
>
>   ENABLED
>
>  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'NONE',
> REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VER
> true
>
>  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>
> 'false', BLOCKCACHE => 'true'}]}
>
>
> 1 row(s) in 0.0370 seconds
>
>
> table B:
>
> hbase(main):003:0> describe 'ADMd5'
> DESCRIPTION
>
>   ENABLED
>
>  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER => 'ROW',
> REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY', MIN_VE
> true
>
>  RSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>
> 'false', BLOCKCACHE => 'true'}]}
>
>
> 1 row(s) in 0.0280 seconds
>
>
>
> The containing foldersize in hdfs:
> table A:
> sudo -u hdfs hadoop fs -dus -h /a_d
> dus: DEPRECATED: Please use 'du -s' instead.
> 227.4g  /a_d
>
> table B:
> sudo -u hdfs hadoop fs -dus -h /a_d
> dus: DEPRECATED: Please use 'du -s' instead.
> 501.0g  /a_d
>
>
> https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
>