You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Naveen <na...@cleartrip.com> on 2012/09/24 17:01:01 UTC

Mass dumping of data has issues

Hi,

I've come across the following issue for which I'm unable to deduce what the
root-cause could be.

Scenario:
I'm trying to dump data(8.3M+ records) from mysql into a hbase table using
multi-threading(25 threads dumping 10 puts/tuples at a time).

Config:
hbase v 0.92.0
hadoop v 1.0
1 master + 4 slaves
table is pre-split

Issue:
Getting a NPE because RPC call takes longer than timeout(default 60 sec).
I'm not worried about the NPE(it's been fixed in 0.92.1+ releases) but about
what could be causing RPC call to timeout on arbitrary intervals.

Custom printed log : pastebin.com/r85wv8Yt

WARN [Thread-99255] (HConnectionManager.java:1587) - Failed all from
region=dump,a405cdd9-b5b7-4ec2-9f91-fea98d5cb656,1348331511473.77f13d455fd63
c601816759b6ed575e8., hostname=hdslave1.company.com, port=60020
java.util.concurrent.ExecutionException: java.lang.RuntimeException:
java.lang.NullPointerException
	at
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
	at java.util.concurrent.FutureTask.get(FutureTask.java:83)
	at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.
processBatchCallback(HConnectionManager.java:1557)
	at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.
processBatch(HConnectionManager.java:1409)
	at
org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:900)
	at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:777)
	at org.apache.hadoop.hbase.client.HTable.put(HTable.java:760)
	at
org.apache.hadoop.hbase.client.HTablePool$PooledHTable.put(HTablePool.java:4
02)
	at coprocessor.dump.Dumper.run(Dumper.java:41)
	at java.lang.Thread.run(Thread.java:662)
	
Any help or insights are welcome.

Warm Regards,
Naveen

RE: Mass dumping of data has issues

Posted by Naveen <na...@cleartrip.com>.

Thank you for the quick response Paul. I've switched off the autoFlush
(haven't increased the writebuffer though). And the splits are pretty
effective, I think because of the similar number of requests each region
gets before they split further.

As per your suggestion, I tried the same task two more times with increased
(15 and 20) handlers for regionservers and the results, though not ideal,
are better than what they were with default number of handlers. With 15
handlers per regionserver, the results were actually as bad (multiple NPE
'coz of RPC call timeouts) as with default value. But with 20, the NPEs
reduced dramatically (still not gone though). I think I'll try again with 25
and report back with the results.

Also, I've seen some data loss in the process, I'm not sure its because of
RPC timeouts as I've had exactly equal number of rows missing in the
attempts I made(once with no. of handlers as 10 and the other time as 20)
where number of NPE exceptions varied drastically. Any tips where/what I
should be focusing on to uncover the cause of data-loss? (PS: I'm using
primary key of the mysql table as my rowkey)

Warm Regards,
Naveen

-----Original Message-----
From: Paul Mackles [mailto:pmackles@adobe.com] 
Sent: Monday, September 24, 2012 8:51 PM
To: user@hbase.apache.org
Subject: Re: Mass dumping of data has issues

Did you adjust the writebuffer to a larger size and/or turn off autoFlush
for the Htable? I've found that both of those settings can have a profound
impact on write performance. You might also look at adjusting the handler
count for the regionservers which by default is pretty low. You should
also confirm that your splits are effective in distributing the writes.

On 9/24/12 11:01 AM, "Naveen" <na...@cleartrip.com> wrote:

>Hi,
>
>I've come across the following issue for which I'm unable to deduce what
>the
>root-cause could be.
>
>Scenario:
>I'm trying to dump data(8.3M+ records) from mysql into a hbase table using
>multi-threading(25 threads dumping 10 puts/tuples at a time).
>
>Config:
>hbase v 0.92.0
>hadoop v 1.0
>1 master + 4 slaves
>table is pre-split
>
>Issue:
>Getting a NPE because RPC call takes longer than timeout(default 60 sec).
>I'm not worried about the NPE(it's been fixed in 0.92.1+ releases) but
>about
>what could be causing RPC call to timeout on arbitrary intervals.
>
>Custom printed log : pastebin.com/r85wv8Yt
>
>WARN [Thread-99255] (HConnectionManager.java:1587) - Failed all from
>region=dump,a405cdd9-b5b7-4ec2-9f91-fea98d5cb656,1348331511473.77f13d455fd
>63
>c601816759b6ed575e8., hostname=hdslave1.company.com, port=60020
>java.util.concurrent.ExecutionException: java.lang.RuntimeException:
>java.lang.NullPointerException
>	at
>java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>	at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>	at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.
>processBatchCallback(HConnectionManager.java:1557)
>	at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.
>processBatch(HConnectionManager.java:1409)
>	at
>org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:900)
>	at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:777)
>	at org.apache.hadoop.hbase.client.HTable.put(HTable.java:760)
>	at
>org.apache.hadoop.hbase.client.HTablePool$PooledHTable.put(HTablePool.java
>:4
>02)
>	at coprocessor.dump.Dumper.run(Dumper.java:41)
>	at java.lang.Thread.run(Thread.java:662)
>	
>Any help or insights are welcome.
>
>Warm Regards,
>Naveen 
>

RE: Mass dumping of data has issues

Posted by Naveen <na...@cleartrip.com>.

Hi Ram,

I'm using HTablePool to get the instance for each thread. So, no, I'm not
sharing the same instance across multiple threads.

Regards,
Naveen

-----Original Message-----
From: Ramkrishna.S.Vasudevan [mailto:ramkrishna.vasudevan@huawei.com] 
Sent: Wednesday, September 26, 2012 12:15 PM
To: user@hbase.apache.org
Subject: RE: Mass dumping of data has issues

For the NPE that you got, is the same HTable instance shared by different
threads.  This is a common problem users encounter while using HTable across
multiple threads.
Pls check and ensure that the HTable is not shared.

Regards
Ram

> -----Original Message-----
> From: Naveen [mailto:naveen.moorjani@cleartrip.com]
> Sent: Wednesday, September 26, 2012 11:52 AM
> To: user@hbase.apache.org
> Subject: RE: Mass dumping of data has issues
> 
> Hi Dan,
> 
> I'm actually trying to simulate the kind of load we're expecting on 
> production servers(the intention is not to migrate data), hence a 
> self-written program over sqoop. (PS: I've actually tried sqoop)
> 
> Warm regards,
> Naveen
> 
> -----Original Message-----
> From: Dan Han [mailto:dannahan2008@gmail.com]
> Sent: Wednesday, September 26, 2012 7:20 AM
> To: user@hbase.apache.org
> Subject: Re: Mass dumping of data has issues
> 
> Hi Naveen,
> 
>    There is tool called Sqoop which supports importing the data from 
> relational database to HBase.
> https://blogs.apache.org/sqoop/entry/apache_sqoop_graduates_from_incub
> a
> tor
> 
> Maybe it can help you migrate the data easily.
> 
> Best Wishes
> Dan Han
> 
> On Mon, Sep 24, 2012 at 9:20 AM, Paul Mackles <pm...@adobe.com>
> wrote:
> 
> > Did you adjust the writebuffer to a larger size and/or turn off 
> > autoFlush for the Htable? I've found that both of those settings can 
> > have a profound impact on write performance. You might also look at 
> > adjusting the handler count for the regionservers which by default 
> > is pretty low. You should also confirm that your splits are 
> > effective in
> distributing the writes.
> >
> > On 9/24/12 11:01 AM, "Naveen" <na...@cleartrip.com> wrote:
> >
> > >Hi,
> > >
> > >I've come across the following issue for which I'm unable to deduce 
> > >what the root-cause could be.
> > >
> > >Scenario:
> > >I'm trying to dump data(8.3M+ records) from mysql into a hbase 
> > >table using
> > >multi-threading(25 threads dumping 10 puts/tuples at a time).
> > >
> > >Config:
> > >hbase v 0.92.0
> > >hadoop v 1.0
> > >1 master + 4 slaves
> > >table is pre-split
> > >
> > >Issue:
> > >Getting a NPE because RPC call takes longer than timeout(default 60
> sec).
> > >I'm not worried about the NPE(it's been fixed in 0.92.1+ releases) 
> > >but about what could be causing RPC call to timeout on arbitrary 
> > >intervals.
> > >
> > >Custom printed log : pastebin.com/r85wv8Yt
> > >
> > >WARN [Thread-99255] (HConnectionManager.java:1587) - Failed all 
> > >from
> > >region=dump,a405cdd9-b5b7-4ec2-9f91-
> fea98d5cb656,1348331511473.77f13d
> > >455fd
> > >63
> > >c601816759b6ed575e8., hostname=hdslave1.company.com, port=60020
> > >java.util.concurrent.ExecutionException: java.lang.RuntimeException:
> > >java.lang.NullPointerException
> > >       at
> > >java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> > >       at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> > >       at
> >
> >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
> > >tatio
> > >n.
> > >processBatchCallback(HConnectionManager.java:1557)
> > >       at
> >
> >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
> > >tatio
> > >n.
> > >processBatch(HConnectionManager.java:1409)
> > >       at
> > >org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:900)
> > >       at
> org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:777)
> > >       at org.apache.hadoop.hbase.client.HTable.put(HTable.java:760)
> > >       at
> >
> >org.apache.hadoop.hbase.client.HTablePool$PooledHTable.put(HTablePool
> > >.java
> > >:4
> > >02)
> > >       at coprocessor.dump.Dumper.run(Dumper.java:41)
> > >       at java.lang.Thread.run(Thread.java:662)
> > >
> > >Any help or insights are welcome.
> > >
> > >Warm Regards,
> > >Naveen
> > >
> >
> >

RE: Mass dumping of data has issues

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.

For the NPE that you got, is the same HTable instance shared by different
threads.  This is a common problem users encounter while using HTable across
multiple threads.
Pls check and ensure that the HTable is not shared.

Regards
Ram

> -----Original Message-----
> From: Naveen [mailto:naveen.moorjani@cleartrip.com]
> Sent: Wednesday, September 26, 2012 11:52 AM
> To: user@hbase.apache.org
> Subject: RE: Mass dumping of data has issues
> 
> Hi Dan,
> 
> I'm actually trying to simulate the kind of load we're expecting on
> production servers(the intention is not to migrate data), hence a
> self-written program over sqoop. (PS: I've actually tried sqoop)
> 
> Warm regards,
> Naveen
> 
> -----Original Message-----
> From: Dan Han [mailto:dannahan2008@gmail.com]
> Sent: Wednesday, September 26, 2012 7:20 AM
> To: user@hbase.apache.org
> Subject: Re: Mass dumping of data has issues
> 
> Hi Naveen,
> 
>    There is tool called Sqoop which supports importing the data from
> relational database to HBase.
> https://blogs.apache.org/sqoop/entry/apache_sqoop_graduates_from_incuba
> tor
> 
> Maybe it can help you migrate the data easily.
> 
> Best Wishes
> Dan Han
> 
> On Mon, Sep 24, 2012 at 9:20 AM, Paul Mackles <pm...@adobe.com>
> wrote:
> 
> > Did you adjust the writebuffer to a larger size and/or turn off
> > autoFlush for the Htable? I've found that both of those settings can
> > have a profound impact on write performance. You might also look at
> > adjusting the handler count for the regionservers which by default is
> > pretty low. You should also confirm that your splits are effective in
> distributing the writes.
> >
> > On 9/24/12 11:01 AM, "Naveen" <na...@cleartrip.com> wrote:
> >
> > >Hi,
> > >
> > >I've come across the following issue for which I'm unable to deduce
> > >what the root-cause could be.
> > >
> > >Scenario:
> > >I'm trying to dump data(8.3M+ records) from mysql into a hbase table
> > >using
> > >multi-threading(25 threads dumping 10 puts/tuples at a time).
> > >
> > >Config:
> > >hbase v 0.92.0
> > >hadoop v 1.0
> > >1 master + 4 slaves
> > >table is pre-split
> > >
> > >Issue:
> > >Getting a NPE because RPC call takes longer than timeout(default 60
> sec).
> > >I'm not worried about the NPE(it's been fixed in 0.92.1+ releases)
> > >but about what could be causing RPC call to timeout on arbitrary
> > >intervals.
> > >
> > >Custom printed log : pastebin.com/r85wv8Yt
> > >
> > >WARN [Thread-99255] (HConnectionManager.java:1587) - Failed all from
> > >region=dump,a405cdd9-b5b7-4ec2-9f91-
> fea98d5cb656,1348331511473.77f13d
> > >455fd
> > >63
> > >c601816759b6ed575e8., hostname=hdslave1.company.com, port=60020
> > >java.util.concurrent.ExecutionException: java.lang.RuntimeException:
> > >java.lang.NullPointerException
> > >       at
> > >java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> > >       at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> > >       at
> >
> >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
> > >tatio
> > >n.
> > >processBatchCallback(HConnectionManager.java:1557)
> > >       at
> >
> >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
> > >tatio
> > >n.
> > >processBatch(HConnectionManager.java:1409)
> > >       at
> > >org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:900)
> > >       at
> org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:777)
> > >       at org.apache.hadoop.hbase.client.HTable.put(HTable.java:760)
> > >       at
> >
> >org.apache.hadoop.hbase.client.HTablePool$PooledHTable.put(HTablePool
> > >.java
> > >:4
> > >02)
> > >       at coprocessor.dump.Dumper.run(Dumper.java:41)
> > >       at java.lang.Thread.run(Thread.java:662)
> > >
> > >Any help or insights are welcome.
> > >
> > >Warm Regards,
> > >Naveen
> > >
> >
> >

RE: Mass dumping of data has issues

Posted by Naveen <na...@cleartrip.com>.

Hi Dan,

I'm actually trying to simulate the kind of load we're expecting on
production servers(the intention is not to migrate data), hence a
self-written program over sqoop. (PS: I've actually tried sqoop)

Warm regards,
Naveen

-----Original Message-----
From: Dan Han [mailto:dannahan2008@gmail.com] 
Sent: Wednesday, September 26, 2012 7:20 AM
To: user@hbase.apache.org
Subject: Re: Mass dumping of data has issues

Hi Naveen,

   There is tool called Sqoop which supports importing the data from
relational database to HBase.
https://blogs.apache.org/sqoop/entry/apache_sqoop_graduates_from_incubator

Maybe it can help you migrate the data easily.

Best Wishes
Dan Han

On Mon, Sep 24, 2012 at 9:20 AM, Paul Mackles <pm...@adobe.com> wrote:

> Did you adjust the writebuffer to a larger size and/or turn off 
> autoFlush for the Htable? I've found that both of those settings can 
> have a profound impact on write performance. You might also look at 
> adjusting the handler count for the regionservers which by default is 
> pretty low. You should also confirm that your splits are effective in
distributing the writes.
>
> On 9/24/12 11:01 AM, "Naveen" <na...@cleartrip.com> wrote:
>
> >Hi,
> >
> >I've come across the following issue for which I'm unable to deduce 
> >what the root-cause could be.
> >
> >Scenario:
> >I'm trying to dump data(8.3M+ records) from mysql into a hbase table 
> >using
> >multi-threading(25 threads dumping 10 puts/tuples at a time).
> >
> >Config:
> >hbase v 0.92.0
> >hadoop v 1.0
> >1 master + 4 slaves
> >table is pre-split
> >
> >Issue:
> >Getting a NPE because RPC call takes longer than timeout(default 60 sec).
> >I'm not worried about the NPE(it's been fixed in 0.92.1+ releases) 
> >but about what could be causing RPC call to timeout on arbitrary 
> >intervals.
> >
> >Custom printed log : pastebin.com/r85wv8Yt
> >
> >WARN [Thread-99255] (HConnectionManager.java:1587) - Failed all from 
> >region=dump,a405cdd9-b5b7-4ec2-9f91-fea98d5cb656,1348331511473.77f13d
> >455fd
> >63
> >c601816759b6ed575e8., hostname=hdslave1.company.com, port=60020
> >java.util.concurrent.ExecutionException: java.lang.RuntimeException:
> >java.lang.NullPointerException
> >       at
> >java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> >       at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> >       at
> >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
> >tatio
> >n.
> >processBatchCallback(HConnectionManager.java:1557)
> >       at
> >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
> >tatio
> >n.
> >processBatch(HConnectionManager.java:1409)
> >       at
> >org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:900)
> >       at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:777)
> >       at org.apache.hadoop.hbase.client.HTable.put(HTable.java:760)
> >       at
> >org.apache.hadoop.hbase.client.HTablePool$PooledHTable.put(HTablePool
> >.java
> >:4
> >02)
> >       at coprocessor.dump.Dumper.run(Dumper.java:41)
> >       at java.lang.Thread.run(Thread.java:662)
> >
> >Any help or insights are welcome.
> >
> >Warm Regards,
> >Naveen
> >
>
>

Re: Mass dumping of data has issues

Posted by Dan Han <da...@gmail.com>.

Hi Naveen,

   There is tool called Sqoop which supports importing the data from
relational database to HBase.
https://blogs.apache.org/sqoop/entry/apache_sqoop_graduates_from_incubator

Maybe it can help you migrate the data easily.

Best Wishes
Dan Han

On Mon, Sep 24, 2012 at 9:20 AM, Paul Mackles <pm...@adobe.com> wrote:

> Did you adjust the writebuffer to a larger size and/or turn off autoFlush
> for the Htable? I've found that both of those settings can have a profound
> impact on write performance. You might also look at adjusting the handler
> count for the regionservers which by default is pretty low. You should
> also confirm that your splits are effective in distributing the writes.
>
> On 9/24/12 11:01 AM, "Naveen" <na...@cleartrip.com> wrote:
>
> >Hi,
> >
> >I've come across the following issue for which I'm unable to deduce what
> >the
> >root-cause could be.
> >
> >Scenario:
> >I'm trying to dump data(8.3M+ records) from mysql into a hbase table using
> >multi-threading(25 threads dumping 10 puts/tuples at a time).
> >
> >Config:
> >hbase v 0.92.0
> >hadoop v 1.0
> >1 master + 4 slaves
> >table is pre-split
> >
> >Issue:
> >Getting a NPE because RPC call takes longer than timeout(default 60 sec).
> >I'm not worried about the NPE(it's been fixed in 0.92.1+ releases) but
> >about
> >what could be causing RPC call to timeout on arbitrary intervals.
> >
> >Custom printed log : pastebin.com/r85wv8Yt
> >
> >WARN [Thread-99255] (HConnectionManager.java:1587) - Failed all from
> >region=dump,a405cdd9-b5b7-4ec2-9f91-fea98d5cb656,1348331511473.77f13d455fd
> >63
> >c601816759b6ed575e8., hostname=hdslave1.company.com, port=60020
> >java.util.concurrent.ExecutionException: java.lang.RuntimeException:
> >java.lang.NullPointerException
> >       at
> >java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> >       at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> >       at
> >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
> >n.
> >processBatchCallback(HConnectionManager.java:1557)
> >       at
> >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
> >n.
> >processBatch(HConnectionManager.java:1409)
> >       at
> >org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:900)
> >       at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:777)
> >       at org.apache.hadoop.hbase.client.HTable.put(HTable.java:760)
> >       at
> >org.apache.hadoop.hbase.client.HTablePool$PooledHTable.put(HTablePool.java
> >:4
> >02)
> >       at coprocessor.dump.Dumper.run(Dumper.java:41)
> >       at java.lang.Thread.run(Thread.java:662)
> >
> >Any help or insights are welcome.
> >
> >Warm Regards,
> >Naveen
> >
>
>

Re: Mass dumping of data has issues

Posted by Paul Mackles <pm...@adobe.com>.

Did you adjust the writebuffer to a larger size and/or turn off autoFlush
for the Htable? I've found that both of those settings can have a profound
impact on write performance. You might also look at adjusting the handler
count for the regionservers which by default is pretty low. You should
also confirm that your splits are effective in distributing the writes.

On 9/24/12 11:01 AM, "Naveen" <na...@cleartrip.com> wrote:

>Hi,
>
>I've come across the following issue for which I'm unable to deduce what
>the
>root-cause could be.
>
>Scenario:
>I'm trying to dump data(8.3M+ records) from mysql into a hbase table using
>multi-threading(25 threads dumping 10 puts/tuples at a time).
>
>Config:
>hbase v 0.92.0
>hadoop v 1.0
>1 master + 4 slaves
>table is pre-split
>
>Issue:
>Getting a NPE because RPC call takes longer than timeout(default 60 sec).
>I'm not worried about the NPE(it's been fixed in 0.92.1+ releases) but
>about
>what could be causing RPC call to timeout on arbitrary intervals.
>
>Custom printed log : pastebin.com/r85wv8Yt
>
>WARN [Thread-99255] (HConnectionManager.java:1587) - Failed all from
>region=dump,a405cdd9-b5b7-4ec2-9f91-fea98d5cb656,1348331511473.77f13d455fd
>63
>c601816759b6ed575e8., hostname=hdslave1.company.com, port=60020
>java.util.concurrent.ExecutionException: java.lang.RuntimeException:
>java.lang.NullPointerException
>	at
>java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>	at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>	at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.
>processBatchCallback(HConnectionManager.java:1557)
>	at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.
>processBatch(HConnectionManager.java:1409)
>	at
>org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:900)
>	at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:777)
>	at org.apache.hadoop.hbase.client.HTable.put(HTable.java:760)
>	at
>org.apache.hadoop.hbase.client.HTablePool$PooledHTable.put(HTablePool.java
>:4
>02)
>	at coprocessor.dump.Dumper.run(Dumper.java:41)
>	at java.lang.Thread.run(Thread.java:662)
>	
>Any help or insights are welcome.
>
>Warm Regards,
>Naveen 
>