You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by ap...@borkbork.net on 2015/05/24 17:43:48 UTC

Issues with import from 0.92 into 0.98

Hello all-

I'm hoping someone can point me in the right direction as I've exhausted
all my knowledge and abilities on the topic...

I've inherited an old, poorly configured and brittle CDH4 cluster
running HBase 0.92. I'm attempting to migrate the data to a new Ambari
cluster running HBase 0.98. I'm attempting to do this without changing
anything on the old cluster as I have hard enough time keeping it
running as is. Also, due to configuration issues with the old cluster
(on AWS), a direct HBase to HBase table copy, or even HDFS to HDFS copy
is out of the question at the moment. 

I was able to use the export task on the old cluster to dump the HBase
tables to HDFS, which I then distcp s3n copied up to S3, then back down
to the new  cluster, then used the HBase importer. This appears to work
fine...

... except that on the new cluster table scans with column filters do
not work. 

A sample row looks something this:
A:9223370612274019807:twtr:56935907581904486 column=x:twitter:username,
timestamp=1424592575087, value=Bilo Selhi

Unfortunately, even though I can see the column is properly defined, I
cannot filter on it:

hbase(main):015:0> scan 'content' , {LIMIT=>10,
COLUMNS=>'x:twitter:username'}
ROW                           COLUMN+CELL                                
0 row(s) in 352.7990 seconds

Any ideas what the heck is going here?

Here's the rough process I used for the export/import:
Old cluster:
$ hbase org.apache.hadoop.hbase.mapreduce.Driver export content
hdfs:///hbase_content 
$ hadoop distcp -Dfs.s3n.awsAccessKeyId='xxxx'
-Dfs.s3n.awsSecretAccessKey='xxxx' -i hdfs:///hbase_content
s3n://hbase_content

New cluster:
$ hadoop distcp -Dfs.s3n.awsAccessKeyId='xxxx'
-Dfs.s3n.awsSecretAccessKey='xxxx' -i s3n://hbase_content
hdfs:///hbase_content
$ hbase -Dhbase.import.version=0.94
org.apache.hadoop.hbase.mapreduce.Driver import content
hdfs:///hbase_content

Thanks!
Z

Re: Issues with import from 0.92 into 0.98

Posted by ap...@borkbork.net.

On Wed, May 27, 2015, at 11:41 AM, Nick Dimiduk wrote:
> Scanning without the column filter produces data?
> 
> The content table on the new cluster has the same column family names
> ('x',
> in your example above)?
> 

Yes, if I scan without a column filter (and I should probably try some
other filters at some point), the data is returned correctly. 

I dumped the table schema from 0.92 using the shell 'describe' command.
I had to modify the 0.92 schema since it uses deprecated syntax before
recreating it in 0.98. 

0.92 output of 'describe':
{NAME => 'content', FAMILIES => [{NAME => 'x', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
MIN_VERSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

I modified this for import into 0.98:
create 'content', {NAME => 'x', BLOOMFILTER => 'NONE', REPLICATION_SCOPE
=> '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL
=> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}

Did I get the table schema wrong in 0.98?

Z

Re: Issues with import from 0.92 into 0.98

Posted by Nick Dimiduk <nd...@gmail.com>.

Scanning without the column filter produces data?

The content table on the new cluster has the same column family names ('x',
in your example above)?

On Wed, May 27, 2015 at 8:35 AM, Dave Latham <la...@davelink.net> wrote:

> Sounds like quite a puzzle.
>
> You mentioned that you can read data written through manual Puts from
> the shell - but not data from the Import.  There must be something
> different about the data itself once it's in the table.  Can you
> compare a row that was imported to a row that was manually written -
> or show them to us?
>
> On Wed, May 27, 2015 at 7:09 AM,  <ap...@borkbork.net> wrote:
> > So more experimentation over the long weekend on this.
> >
> > If I load sample data into the new cluster table manually through the
> > shell, column filters work as expected.
> >
> > Obviously not a solution to the problem. Anyone have any ideas or things
> > I should be looking at? The regionserver logs show nothing unusual.
> >
> > Is there another export/import chain I could try?
> >
> > Thanks,
> > Zack
> >
> >
> > On Sun, May 24, 2015, at 11:43 AM, apache@borkbork.net wrote:
> >> Hello all-
> >>
> >> I'm hoping someone can point me in the right direction as I've exhausted
> >> all my knowledge and abilities on the topic...
> >>
> >> I've inherited an old, poorly configured and brittle CDH4 cluster
> >> running HBase 0.92. I'm attempting to migrate the data to a new Ambari
> >> cluster running HBase 0.98. I'm attempting to do this without changing
> >> anything on the old cluster as I have hard enough time keeping it
> >> running as is. Also, due to configuration issues with the old cluster
> >> (on AWS), a direct HBase to HBase table copy, or even HDFS to HDFS copy
> >> is out of the question at the moment.
> >>
> >> I was able to use the export task on the old cluster to dump the HBase
> >> tables to HDFS, which I then distcp s3n copied up to S3, then back down
> >> to the new  cluster, then used the HBase importer. This appears to work
> >> fine...
> >>
> >> ... except that on the new cluster table scans with column filters do
> >> not work.
> >>
> >> A sample row looks something this:
> >> A:9223370612274019807:twtr:56935907581904486 column=x:twitter:username,
> >> timestamp=1424592575087, value=Bilo Selhi
> >>
> >> Unfortunately, even though I can see the column is properly defined, I
> >> cannot filter on it:
> >>
> >> hbase(main):015:0> scan 'content' , {LIMIT=>10,
> >> COLUMNS=>'x:twitter:username'}
> >> ROW                           COLUMN+CELL
> >> 0 row(s) in 352.7990 seconds
> >>
> >> Any ideas what the heck is going here?
> >>
> >> Here's the rough process I used for the export/import:
> >> Old cluster:
> >> $ hbase org.apache.hadoop.hbase.mapreduce.Driver export content
> >> hdfs:///hbase_content
> >> $ hadoop distcp -Dfs.s3n.awsAccessKeyId='xxxx'
> >> -Dfs.s3n.awsSecretAccessKey='xxxx' -i hdfs:///hbase_content
> >> s3n://hbase_content
> >>
> >> New cluster:
> >> $ hadoop distcp -Dfs.s3n.awsAccessKeyId='xxxx'
> >> -Dfs.s3n.awsSecretAccessKey='xxxx' -i s3n://hbase_content
> >> hdfs:///hbase_content
> >> $ hbase -Dhbase.import.version=0.94
> >> org.apache.hadoop.hbase.mapreduce.Driver import content
> >> hdfs:///hbase_content
> >>
> >> Thanks!
> >> Z
>

Re: Issues with import from 0.92 into 0.98

Posted by Dave Latham <la...@davelink.net>.

On Wed, May 27, 2015 at 11:17 AM,  <ap...@borkbork.net> wrote:
> Thanks! I want to make sure I've got it right:
>
> When I import the 0.92 data into 0.98, the columns are defined properly
> in the 0.98 table, but I cannot perform a scan with a column filter in
> the shell as the shell interprets the second ':' in the column filter as
> a formatter. From the bug you opened (HBASE-13788) this formatter
> behavior is *only* present in the shell,  programmatic queries do not
> have this filtering issue.

That's my best guess after poking at the shell code and reproducing your issue.

> Do you know if there are any temporary workarounds in the meantime or
> alternatives to the shell for HBASE-13788? Hue or Phoenix perhaps? I
> know the shell is heavily used by the engineering team to spot check
> stuff and to lose that capability is going to really throw them for a
> loop.

Good question and others familiar with those tools may be able to give
you a better answer - I haven't used them.

If you're happy without the FORMATTER support in your shell you could
remove the line from the shell that is doing that parsing.  If you
look at lib/ruby/hbase/table.rb, find def parse_column_name, and
remove the line "set_converter(split) if split.length > 1" that should
do it.  Here's the line in master as an example:
https://github.com/apache/hbase/blob/master/hbase-shell/src/main/ruby/hbase/table.rb#L633

>
> Thanks again,
> Zack

Re: Issues with import from 0.92 into 0.98

Posted by ap...@borkbork.net.

On Wed, May 27, 2015, at 01:54 PM, Dave Latham wrote:
> It looks like the hbase shell (beginning with 0.96) parses column
> names as FAMILY:QUALIFIER[:FORMATTER] due to work from HBASE-6592.
> As a result, the shell basically doesn't support specifying any
> columns (for gets/puts/scans/etc) that include a colon in the
> qualifier.  I filed HBASE-13788.
> 
> For your case, I suspect the data was properly imported, but when you
> tried to scan for "x:twitter:username" it instead scanned for
> "x:twitter" and found nothing.
> 

Dave-

Thanks! I want to make sure I've got it right:

When I import the 0.92 data into 0.98, the columns are defined properly
in the 0.98 table, but I cannot perform a scan with a column filter in
the shell as the shell interprets the second ':' in the column filter as
a formatter. From the bug you opened (HBASE-13788) this formatter
behavior is *only* present in the shell,  programmatic queries do not
have this filtering issue. 

Do you know if there are any temporary workarounds in the meantime or
alternatives to the shell for HBASE-13788? Hue or Phoenix perhaps? I
know the shell is heavily used by the engineering team to spot check
stuff and to lose that capability is going to really throw them for a
loop.

Thanks again,
Zack

Re: Issues with import from 0.92 into 0.98

Posted by Dave Latham <la...@davelink.net>.

It looks like the hbase shell (beginning with 0.96) parses column
names as FAMILY:QUALIFIER[:FORMATTER] due to work from HBASE-6592.
As a result, the shell basically doesn't support specifying any
columns (for gets/puts/scans/etc) that include a colon in the
qualifier.  I filed HBASE-13788.

For your case, I suspect the data was properly imported, but when you
tried to scan for "x:twitter:username" it instead scanned for
"x:twitter" and found nothing.

Dave

P.S. Here's some related help text from the shell.

Besides the default 'toStringBinary' format, 'get' also supports
custom formatting by
column.  A user can define a FORMATTER by adding it to the column name
in the get
specification.  The FORMATTER can be stipulated:

 1. either as a org.apache.hadoop.hbase.util.Bytes method name (e.g,
toInt, toString)
 2. or as a custom class followed by method name: e.g.
'c(MyFormatterClass).format'.

Example formatting cf:qualifier1 and cf:qualifier2 both as Integers:
  hbase> get 't1', 'r1' {COLUMN => ['cf:qualifier1:toInt',
    'cf:qualifier2:c(org.apache.hadoop.hbase.util.Bytes).toInt'] }

Note that you can specify a FORMATTER by column only (cf:qualifer).
You cannot specify
a FORMATTER for all columns of a column family.

On Wed, May 27, 2015 at 10:23 AM,  <ap...@borkbork.net> wrote:
> On Wed, May 27, 2015, at 11:35 AM, Dave Latham wrote:
>> Sounds like quite a puzzle.
>>
>> You mentioned that you can read data written through manual Puts from
>> the shell - but not data from the Import.  There must be something
>> different about the data itself once it's in the table.  Can you
>> compare a row that was imported to a row that was manually written -
>> or show them to us?
>
> Hmph, I may have spoken too soon. I know I tested this at one point and
> it worked, but now I'm getting different results:
>
> On the new cluster, I created a duplicate test table:
> hbase(main):043:0> create 'content3', {NAME => 'x', BLOOMFILTER =>
> 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION =>
> 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY => 'false', BLOCKCACHE => 'true'}
>
> Then I pull some data from the imported table:
> hbase(main):045:0> scan 'content', {LIMIT=>1,
> STARTROW=>'A:9223370612089311807:twtr:57013379'}
> ROW                                  COLUMN+CELL
> ....
> A:9223370612089311807:twtr:570133798827921408
> column=x:twitter:username, timestamp=1424775595345, value=BERITA &
> INFORMASI!
>
> Then put it:
> hbase(main):046:0> put
> 'content3','A:9223370612089311807:twtr:570133798827921408',
> 'x:twitter:username', 'BERITA & INFORMASI!'
>
> But then when I query it, I see that I've lost the column qualifier
> ":username":
> hbase(main):046:0> scan 'content3'
> ROW                                  COLUMN+CELL
>  A:9223370612089311807:twtr:570133798827921408 column=x:twitter,
>  timestamp=1432745301788, value=BERITA & INFORMASI!
>
> Even though I'm missing one of the qualifiers, I can at least filter on
> columns in this sample table.
>
> So now I'm even more baffled :(
>
> Z
>

Re: Issues with import from 0.92 into 0.98

Posted by ap...@borkbork.net.

On Wed, May 27, 2015, at 11:35 AM, Dave Latham wrote:
> Sounds like quite a puzzle.
> 
> You mentioned that you can read data written through manual Puts from
> the shell - but not data from the Import.  There must be something
> different about the data itself once it's in the table.  Can you
> compare a row that was imported to a row that was manually written -
> or show them to us?

Hmph, I may have spoken too soon. I know I tested this at one point and
it worked, but now I'm getting different results:

On the new cluster, I created a duplicate test table:
hbase(main):043:0> create 'content3', {NAME => 'x', BLOOMFILTER =>
'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION =>
'NONE', MIN_VERSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}

Then I pull some data from the imported table:
hbase(main):045:0> scan 'content', {LIMIT=>1,
STARTROW=>'A:9223370612089311807:twtr:57013379'}
ROW                                  COLUMN+CELL                         
....
A:9223370612089311807:twtr:570133798827921408 
column=x:twitter:username, timestamp=1424775595345, value=BERITA &
INFORMASI!                            

Then put it:
hbase(main):046:0> put
'content3','A:9223370612089311807:twtr:570133798827921408',
'x:twitter:username', 'BERITA & INFORMASI!'

But then when I query it, I see that I've lost the column qualifier
":username":
hbase(main):046:0> scan 'content3'
ROW                                  COLUMN+CELL                         
 A:9223370612089311807:twtr:570133798827921408 column=x:twitter,
 timestamp=1432745301788, value=BERITA & INFORMASI!                      

Even though I'm missing one of the qualifiers, I can at least filter on
columns in this sample table.

So now I'm even more baffled :(

Z

Re: Issues with import from 0.92 into 0.98

Posted by Dave Latham <la...@davelink.net>.

Sounds like quite a puzzle.

You mentioned that you can read data written through manual Puts from
the shell - but not data from the Import.  There must be something
different about the data itself once it's in the table.  Can you
compare a row that was imported to a row that was manually written -
or show them to us?

On Wed, May 27, 2015 at 7:09 AM,  <ap...@borkbork.net> wrote:
> So more experimentation over the long weekend on this.
>
> If I load sample data into the new cluster table manually through the
> shell, column filters work as expected.
>
> Obviously not a solution to the problem. Anyone have any ideas or things
> I should be looking at? The regionserver logs show nothing unusual.
>
> Is there another export/import chain I could try?
>
> Thanks,
> Zack
>
>
> On Sun, May 24, 2015, at 11:43 AM, apache@borkbork.net wrote:
>> Hello all-
>>
>> I'm hoping someone can point me in the right direction as I've exhausted
>> all my knowledge and abilities on the topic...
>>
>> I've inherited an old, poorly configured and brittle CDH4 cluster
>> running HBase 0.92. I'm attempting to migrate the data to a new Ambari
>> cluster running HBase 0.98. I'm attempting to do this without changing
>> anything on the old cluster as I have hard enough time keeping it
>> running as is. Also, due to configuration issues with the old cluster
>> (on AWS), a direct HBase to HBase table copy, or even HDFS to HDFS copy
>> is out of the question at the moment.
>>
>> I was able to use the export task on the old cluster to dump the HBase
>> tables to HDFS, which I then distcp s3n copied up to S3, then back down
>> to the new  cluster, then used the HBase importer. This appears to work
>> fine...
>>
>> ... except that on the new cluster table scans with column filters do
>> not work.
>>
>> A sample row looks something this:
>> A:9223370612274019807:twtr:56935907581904486 column=x:twitter:username,
>> timestamp=1424592575087, value=Bilo Selhi
>>
>> Unfortunately, even though I can see the column is properly defined, I
>> cannot filter on it:
>>
>> hbase(main):015:0> scan 'content' , {LIMIT=>10,
>> COLUMNS=>'x:twitter:username'}
>> ROW                           COLUMN+CELL
>> 0 row(s) in 352.7990 seconds
>>
>> Any ideas what the heck is going here?
>>
>> Here's the rough process I used for the export/import:
>> Old cluster:
>> $ hbase org.apache.hadoop.hbase.mapreduce.Driver export content
>> hdfs:///hbase_content
>> $ hadoop distcp -Dfs.s3n.awsAccessKeyId='xxxx'
>> -Dfs.s3n.awsSecretAccessKey='xxxx' -i hdfs:///hbase_content
>> s3n://hbase_content
>>
>> New cluster:
>> $ hadoop distcp -Dfs.s3n.awsAccessKeyId='xxxx'
>> -Dfs.s3n.awsSecretAccessKey='xxxx' -i s3n://hbase_content
>> hdfs:///hbase_content
>> $ hbase -Dhbase.import.version=0.94
>> org.apache.hadoop.hbase.mapreduce.Driver import content
>> hdfs:///hbase_content
>>
>> Thanks!
>> Z

Re: Issues with import from 0.92 into 0.98

Posted by ap...@borkbork.net.

So more experimentation over the long weekend on this.

If I load sample data into the new cluster table manually through the
shell, column filters work as expected. 

Obviously not a solution to the problem. Anyone have any ideas or things
I should be looking at? The regionserver logs show nothing unusual.

Is there another export/import chain I could try? 

Thanks,
Zack


On Sun, May 24, 2015, at 11:43 AM, apache@borkbork.net wrote:
> Hello all-
> 
> I'm hoping someone can point me in the right direction as I've exhausted
> all my knowledge and abilities on the topic...
> 
> I've inherited an old, poorly configured and brittle CDH4 cluster
> running HBase 0.92. I'm attempting to migrate the data to a new Ambari
> cluster running HBase 0.98. I'm attempting to do this without changing
> anything on the old cluster as I have hard enough time keeping it
> running as is. Also, due to configuration issues with the old cluster
> (on AWS), a direct HBase to HBase table copy, or even HDFS to HDFS copy
> is out of the question at the moment. 
> 
> I was able to use the export task on the old cluster to dump the HBase
> tables to HDFS, which I then distcp s3n copied up to S3, then back down
> to the new  cluster, then used the HBase importer. This appears to work
> fine...
> 
> ... except that on the new cluster table scans with column filters do
> not work. 
> 
> A sample row looks something this:
> A:9223370612274019807:twtr:56935907581904486 column=x:twitter:username,
> timestamp=1424592575087, value=Bilo Selhi
> 
> Unfortunately, even though I can see the column is properly defined, I
> cannot filter on it:
> 
> hbase(main):015:0> scan 'content' , {LIMIT=>10,
> COLUMNS=>'x:twitter:username'}
> ROW                           COLUMN+CELL                                
> 0 row(s) in 352.7990 seconds
> 
> Any ideas what the heck is going here?
> 
> Here's the rough process I used for the export/import:
> Old cluster:
> $ hbase org.apache.hadoop.hbase.mapreduce.Driver export content
> hdfs:///hbase_content 
> $ hadoop distcp -Dfs.s3n.awsAccessKeyId='xxxx'
> -Dfs.s3n.awsSecretAccessKey='xxxx' -i hdfs:///hbase_content
> s3n://hbase_content
> 
> New cluster:
> $ hadoop distcp -Dfs.s3n.awsAccessKeyId='xxxx'
> -Dfs.s3n.awsSecretAccessKey='xxxx' -i s3n://hbase_content
> hdfs:///hbase_content
> $ hbase -Dhbase.import.version=0.94
> org.apache.hadoop.hbase.mapreduce.Driver import content
> hdfs:///hbase_content
> 
> Thanks!
> Z