You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by vlisovsky <vl...@gmail.com> on 2010/12/10 07:46:42 UTC

Re: Hive HBase intergration scan failing

> Hi Guys,
> Wonder if  anybody could shed some light on how to reduce the load on HBase
> cluster when running a full scan.
> The need is to dump everything I have in HBase and into a Hive table. The
> HBase data size is around 500g.
> The job creates 9000 mappers, after about 1000 maps things go south every
> time..
> If I run below insert it runs for about 30 minutes then starts bringing
> down HBase cluster after which region servers need to be restarted..
> Wonder if there is a way to throttle it somehow or otherwise if there is
> any other method of getting structured data out?
> Any help is appreciated,
> Thanks,
> -Vitaly
>
> create external table hbase_linked_table (
> mykey        string,
> info        map<string, string>,
> )
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH
> SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:")
> TBLPROPERTIES ("hbase.table.name" = "hbase_table2");
>
> set hive.exec.compress.output=true;
> set io.seqfile.compression.type=BLOCK;
> set mapred.output.compression.type=BLOCK;
> set
> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>
> set mapred.reduce.tasks=40;
> set mapred.map.tasks=25;
>
> INSERT overwrite table tmp_hive_destination
> select * from hbase_linked_table;
>

Re: Hive HBase intergration scan failing

Posted by John Sichi <js...@fb.com>.
It's supposed to happen automatically.  The JIRA issue below mentions one case where it wasn't, and explains how I detected it and worked around it.  To make you're getting locality, look at the task tracer and make sure that for your map tasks, the host used for executing the task matches the input split location.

JVS

On Dec 10, 2010, at 10:10 AM, vlisovsky wrote:

> Thanks for the info. Moreover how can we make sure that our regionservers are running with same Datanodes ( locality). Is there a way we can make sure? 
> 
> On Thu, Dec 9, 2010 at 11:09 PM, John Sichi <js...@fb.com> wrote:
> Try
> 
> set hbase.client.scanner.caching=5000;
> 
> Also, check to make sure that you are getting the expected locality so that mappers are running on the same nodes as the region servers they are scanning (assuming that you are running HBase and mapreduce on the same cluster).  When I was testing this, I encountered this problem (but it may have been specific to our cluster configurations):
> 
> https://issues.apache.org/jira/browse/HBASE-2535
> 
> JVS
> 
> On Dec 9, 2010, at 10:46 PM, vlisovsky wrote:
> 
> >
> > Hi Guys,
> > Wonder if  anybody could shed some light on how to reduce the load on HBase cluster when running a full scan.
> > The need is to dump everything I have in HBase and into a Hive table. The HBase data size is around 500g.
> > The job creates 9000 mappers, after about 1000 maps things go south every time..
> > If I run below insert it runs for about 30 minutes then starts bringing down HBase cluster after which region servers need to be restarted..
> > Wonder if there is a way to throttle it somehow or otherwise if there is any other method of getting structured data out?
> > Any help is appreciated,
> > Thanks,
> > -Vitaly
> >
> > create external table hbase_linked_table (
> > mykey        string,
> > info        map<string, string>,
> > )
> > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> > WITH
> > SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:")
> > TBLPROPERTIES ("hbase.table.name" = "hbase_table2");
> >
> > set hive.exec.compress.output=true;
> > set io.seqfile.compression.type=BLOCK;
> > set mapred.output.compression.type=BLOCK;
> > set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
> >
> > set mapred.reduce.tasks=40;
> > set mapred.map.tasks=25;
> >
> > INSERT overwrite table tmp_hive_destination
> > select * from hbase_linked_table;
> >
> 
> 


Re: Hive HBase intergration scan failing

Posted by vlisovsky <vl...@gmail.com>.
Thanks for the info. Moreover how can we make sure that our regionservers
are running with same Datanodes ( locality). Is there a way we can make
sure?

On Thu, Dec 9, 2010 at 11:09 PM, John Sichi <js...@fb.com> wrote:

> Try
>
> set hbase.client.scanner.caching=5000;
>
> Also, check to make sure that you are getting the expected locality so that
> mappers are running on the same nodes as the region servers they are
> scanning (assuming that you are running HBase and mapreduce on the same
> cluster).  When I was testing this, I encountered this problem (but it may
> have been specific to our cluster configurations):
>
> https://issues.apache.org/jira/browse/HBASE-2535
>
> JVS
>
> On Dec 9, 2010, at 10:46 PM, vlisovsky wrote:
>
> >
> > Hi Guys,
> > Wonder if  anybody could shed some light on how to reduce the load on
> HBase cluster when running a full scan.
> > The need is to dump everything I have in HBase and into a Hive table. The
> HBase data size is around 500g.
> > The job creates 9000 mappers, after about 1000 maps things go south every
> time..
> > If I run below insert it runs for about 30 minutes then starts bringing
> down HBase cluster after which region servers need to be restarted..
> > Wonder if there is a way to throttle it somehow or otherwise if there is
> any other method of getting structured data out?
> > Any help is appreciated,
> > Thanks,
> > -Vitaly
> >
> > create external table hbase_linked_table (
> > mykey        string,
> > info        map<string, string>,
> > )
> > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> > WITH
> > SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:")
> > TBLPROPERTIES ("hbase.table.name" = "hbase_table2");
> >
> > set hive.exec.compress.output=true;
> > set io.seqfile.compression.type=BLOCK;
> > set mapred.output.compression.type=BLOCK;
> > set
> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
> >
> > set mapred.reduce.tasks=40;
> > set mapred.map.tasks=25;
> >
> > INSERT overwrite table tmp_hive_destination
> > select * from hbase_linked_table;
> >
>
>

Re: Hive HBase intergration scan failing

Posted by John Sichi <js...@fb.com>.
Try

set hbase.client.scanner.caching=5000;

Also, check to make sure that you are getting the expected locality so that mappers are running on the same nodes as the region servers they are scanning (assuming that you are running HBase and mapreduce on the same cluster).  When I was testing this, I encountered this problem (but it may have been specific to our cluster configurations):

https://issues.apache.org/jira/browse/HBASE-2535

JVS

On Dec 9, 2010, at 10:46 PM, vlisovsky wrote:

> 
> Hi Guys,
> Wonder if  anybody could shed some light on how to reduce the load on HBase cluster when running a full scan.
> The need is to dump everything I have in HBase and into a Hive table. The HBase data size is around 500g. 
> The job creates 9000 mappers, after about 1000 maps things go south every time..
> If I run below insert it runs for about 30 minutes then starts bringing down HBase cluster after which region servers need to be restarted..
> Wonder if there is a way to throttle it somehow or otherwise if there is any other method of getting structured data out?
> Any help is appreciated,
> Thanks,
> -Vitaly
> 
> create external table hbase_linked_table (
> mykey        string,
> info        map<string, string>,
> )
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH 
> SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:")
> TBLPROPERTIES ("hbase.table.name" = "hbase_table2");
> 
> set hive.exec.compress.output=true;
> set io.seqfile.compression.type=BLOCK;
> set mapred.output.compression.type=BLOCK;
> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
> 
> set mapred.reduce.tasks=40;
> set mapred.map.tasks=25;
> 
> INSERT overwrite table tmp_hive_destination
> select * from hbase_linked_table;
>