You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Daniel Pruckler <Da...@t8webware.com> on 2010/03/18 17:50:22 UTC

Problem when joining hbase tables in hive

We're just starting with the HBaseIntegration, and are really excited about it, great work on this.  Thanks

 

We are running into a problem when joining two hbase tables.  It seems to usually work when joining one hbase table with one hive table, but not always.  Let me know if I'm doing something wrong.

 

My workaround for right now is to create new tables in hive and overwrite that table with the external hbase table, and then join from the new tables.  In our situation this is working as the one large table we've got seems to work just fine in hbase, we just need to overwrite the smaller join tables before doing a join.  I have one join example at the end of this email that fails when joining an hbase and hive table.

 

 

 

 

Here's what I did:

 

Create the HBase tables:

 

create 'users', 'info'

put 'users', 'user1', 'info:state', 'IA'

put 'users', 'user1', 'info:country', 'USA'

put 'users', 'user1', 'info:country_id', '0'

 

create 'states', 'state'

put 'states', 'IA', 'state:name', 'Iowa'

 

create 'countries', 'info'

put 'countries', 'USA', 'info:name', 'United States'

put 'countries', 'USA', 'info:country', 'USA'

put 'countries', 'USA', 'info:country_id', '1'

 

 

 

Create the hive tables:

 

CREATE EXTERNAL TABLE users(key string, state string, country string, country_id int) 

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.key.type" = "string",

"hbase.columns.mapping" = "info:state,info:country,info:country_id",

"hbase.table.name" = "users"

);

 

CREATE EXTERNAL TABLE states(key string, name string) 

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.key.type" = "string",

"hbase.columns.mapping" = "state:name",

"hbase.table.name" = "states"

);

 

CREATE EXTERNAL TABLE countries(key string, name string, country string, country_id int) 

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.key.type" = "string",

"hbase.columns.mapping" = "info:name,info:country,info:country_id",

"hbase.table.name" = "countries"

);

 

 

Then try to join, this returns no results:

SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country = c.key);

OK

Time taken: 26.75 seconds

 

 

This returns results, but not what's expected, it seems like it joins the user table with itself

SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country = c.country);

OK

user1     USA        NULL      user1

Time taken: 24.867 seconds

 

 

This also returns results (it should not as the country_ids do not match)

SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country_id = c.country_id);

OK

user1     USA        NULL      user1

Time taken: 22.868 seconds

 

 

 

 

I would guess this is part of the same issue, trying to join anything in state with user results in an error (state does not have an 'info' column family)

hive> SELECT u.key, u.state, s.name FROM users u JOIN states s ON (u.state = s.key);

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapred.reduce.tasks=<number>

Starting Job = job_201003150937_0096, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201003150937_0096

Kill Command = /usr/lib/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201003150937_0096

2010-03-18 10:43:28,730 Stage-1 map = 0%,  reduce = 0%

2010-03-18 10:43:34,908 Stage-1 map = 50%,  reduce = 0%

2010-03-18 10:43:44,020 Stage-1 map = 50%,  reduce = 17%

2010-03-18 10:44:02,146 Stage-1 map = 100%,  reduce = 100%

Ended Job = job_201003150937_0096 with errors

 

Task with the most failures(4): 

-----

Task ID:

  task_201003150937_0096_m_000001

 

URL:

  http://localhost:50030/taskdetails.jsp?jobid=job_201003150937_0096&tipid=task_201003150937_0096_m_000001

-----

 

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver

 

org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family state does not exist in region users,,1268926579188 in table {NAME => 'users', FAMILIES => [{NAME => 'info', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

                at org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2476)

                at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1107)

                at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1095)

                at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1846)

                at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)

                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

                at java.lang.reflect.Method.invoke(Method.java:597)

                at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)

                at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

 

                at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

                at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

                at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

                at java.lang.reflect.Constructor.newInstance(Constructor.java:513)

                at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)

                at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1041)

                at org.apache.hadoop.hbase.client.HTable$ClientScanner.nextScanner(HTable.java:1935)

                at org.apache.hadoop.hbase.client.HTable$ClientScanner.initialize(HTable.java:1855)

                at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:376)

                at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.restart(TableInputFormatBase.java:129)

                at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.init(TableInputFormatBase.java:139)

                at org.apache.hadoop.hbase.mapred.TableInputFormatBase.getRecordReader(TableInputFormatBase.java:260)

                at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getRecordReader(HiveHBaseTableInputFormat.java:119)

                at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:229)

                at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)

                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

                at org.apache.hadoop.mapred.Child.main(Child.java:170)

 

 

 

Here's my workaround that allows me to be able to join.

 

CREATE TABLE hive_users(key string, state string, country string, country_id int);

INSERT OVERWRITE TABLE hive_users SELECT key, state, country, country_id FROM users;

 

CREATE TABLE hive_states(key string, name string);

INSERT OVERWRITE TABLE hive_states SELECT key, name FROM states;

 

CREATE TABLE hive_countries(key string, name string, country string, country_id int);

INSERT OVERWRITE TABLE hive_countries SELECT key, name, country, country_id FROM countries;

 

 

Any of my selects work just fine:

SELECT u.key, u.state, s.name FROM hive_users u JOIN hive_states s ON (u.state = s.key);

SELECT u.key, u.country, c.name FROM hive_users u JOIN hive_countries c ON (u.country = c.key);

 

 

 

When joining one hbase table it will work with some tables, and give errors with others, I'm not seeing anything I'm doing wrong with the select.

 

This works:

SELECT u.key, u.country, c.name FROM users u JOIN hive_countries c ON (u.country = c.key);

 

However while trying other things, I found that this query does not work, and cannot figure out why it'd be any different, maybe I'm doing something wrong.

SELECT u.key, u.country, c.name FROM hive_users u JOIN countries c ON (u.country = c.key);

 

Ended Job = job_201003150937_0118 with errors

 

Task with the most failures(4): 

-----

Task ID:

  task_201003150937_0118_m_000001

 

URL:

  http://localhost:50030/taskdetails.jsp?jobid=job_201003150937_0118&tipid=task_201003150937_0118_m_000001

-----

 

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver

 

 

java.lang.IllegalArgumentException: string cannot be null

                at org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:390)

                at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getRecordReader(HiveHBaseTableInputFormat.java:68)

                at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:229)

                at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)

                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

                at org.apache.hadoop.mapred.Child.main(Child.java:170)

 

 

 

 

 

Thanks,

Daniel

RE: Problem when joining hbase tables in hive

Posted by John Sichi <js...@facebook.com>.

Awesome.  Namit just committed the HIVE-1257 patch to trunk.

He also committed my HIVE-1220 patch, which allows TBLPROPERTIES to be specified on CREATE TABLE (not just ALTER TABLE).  This means that for non-native tables, properties unrelated to serde should now be specified via TBLPROPERTIES rather than SERDEPROPERTIES.  I'm going to update the examples on the wiki accordingly, e.g.

CREATE TABLE hbase_table_1(key int, value string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");

So once you pick up the HIVE-1220 commit, start using the new syntax.

Note that TBLPROPERTIES can be set on any table for your own metadata purposes, not just non-native tables.  And for good measure, I added support for them in CREATE/ALTER VIEW as well.

JVS
________________________________________
From: Daniel Pruckler [Daniel.Pruckler@t8webware.com]
Sent: Monday, March 22, 2010 1:32 PM
To: hive-user@hadoop.apache.org
Subject: RE: Problem when joining hbase tables in hive

John,

I wanted to post back and say thanks again, I just downloaded the patch you posted and hbase joins are working great.

Thank you,
Daniel

RE: Problem when joining hbase tables in hive

Posted by Daniel Pruckler <Da...@t8webware.com>.

John,

 

I wanted to post back and say thanks again, I just downloaded the patch you posted and hbase joins are working great.

 

Thank you,

Daniel

 

 

From: John Sichi [mailto:jsichi@facebook.com] 
Sent: Thursday, March 18, 2010 6:02 PM
To: hive-user@hadoop.apache.org
Subject: Re: Problem when joining hbase tables in hive 

 

Hi Daniel,

 

Thank you so much for the very detailed test case.  It's a great example of excellent, easy-to-reproduce bug reporting.

 

The problem shows up for me too; I think it's due to an issue mentioned by Namit while he was reviewing the patch.  The reason it doesn't show up in the unit test is that the unit test populates identical data in the Hive table and the HBase table before exercising the join, whoops.

 

I've created this issue for tracking purposes:

 

https://issues.apache.org/jira/browse/HIVE-1257

 

JVS

 

On Mar 18, 2010, at 9:50 AM, Daniel Pruckler wrote:





We're just starting with the HBaseIntegration, and are really excited about it, great work on this.  Thanks

 

We are running into a problem when joining two hbase tables.  It seems to usually work when joining one hbase table with one hive table, but not always.  Let me know if I'm doing something wrong.

 

My workaround for right now is to create new tables in hive and overwrite that table with the external hbase table, and then join from the new tables.  In our situation this is working as the one large table we've got seems to work just fine in hbase, we just need to overwrite the smaller join tables before doing a join.  I have one join example at the end of this email that fails when joining an hbase and hive table.

 

 

 

 

Here's what I did:

 

Create the HBase tables:

 

create 'users', 'info'

put 'users', 'user1', 'info:state', 'IA'

put 'users', 'user1', 'info:country', 'USA'

put 'users', 'user1', 'info:country_id', '0'

 

create 'states', 'state'

put 'states', 'IA', 'state:name', 'Iowa'

 

create 'countries', 'info'

put 'countries', 'USA', 'info:name', 'United States'

put 'countries', 'USA', 'info:country', 'USA'

put 'countries', 'USA', 'info:country_id', '1'

 

 

 

Create the hive tables:

 

CREATE EXTERNAL TABLE users(key string, state string, country string, country_id int)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.key.type" = "string",

"hbase.columns.mapping" = "info:state,info:country,info:country_id",

"hbase.table.name" = "users"

);

 

CREATE EXTERNAL TABLE states(key string, name string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.key.type" = "string",

"hbase.columns.mapping" = "state:name",

"hbase.table.name" = "states"

);

 

CREATE EXTERNAL TABLE countries(key string, name string, country string, country_id int)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.key.type" = "string",

"hbase.columns.mapping" = "info:name,info:country,info:country_id",

"hbase.table.name" = "countries"

);

 

 

Then try to join, this returns no results:

SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country = c.key);

OK

Time taken: 26.75 seconds

 

 

This returns results, but not what's expected, it seems like it joins the user table with itself

SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country = c.country);

OK

user1     USA        NULL      user1

Time taken: 24.867 seconds

 

 

This also returns results (it should not as the country_ids do not match)

SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country_id = c.country_id);

OK

user1     USA        NULL      user1

Time taken: 22.868 seconds

 

 

 

 

I would guess this is part of the same issue, trying to join anything in state with user results in an error (state does not have an 'info' column family)

hive> SELECT u.key, u.state, s.name FROM users u JOIN states s ON (u.state = s.key);

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapred.reduce.tasks=<number>

Starting Job = job_201003150937_0096, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201003150937_0096

Kill Command = /usr/lib/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201003150937_0096

2010-03-18 10:43:28,730 Stage-1 map = 0%,  reduce = 0%

2010-03-18 10:43:34,908 Stage-1 map = 50%,  reduce = 0%

2010-03-18 10:43:44,020 Stage-1 map = 50%,  reduce = 17%

2010-03-18 10:44:02,146 Stage-1 map = 100%,  reduce = 100%

Ended Job = job_201003150937_0096 with errors

 

Task with the most failures(4):

-----

Task ID:

  task_201003150937_0096_m_000001

 

URL:

  http://localhost:50030/taskdetails.jsp?jobid=job_201003150937_0096&tipid=task_201003150937_0096_m_000001

-----

 

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver

 

org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family state does not exist in region users,,1268926579188 in table {NAME => 'users', FAMILIES => [{NAME => 'info', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

                at org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2476)

                at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1107)

                at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1095)

                at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1846)

                at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)

                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

                at java.lang.reflect.Method.invoke(Method.java:597)

                at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)

                at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

 

                at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

                at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

                at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

                at java.lang.reflect.Constructor.newInstance(Constructor.java:513)

                at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)

                at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1041)

                at org.apache.hadoop.hbase.client.HTable$ClientScanner.nextScanner(HTable.java:1935)

                at org.apache.hadoop.hbase.client.HTable$ClientScanner.initialize(HTable.java:1855)

                at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:376)

                at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.restart(TableInputFormatBase.java:129)

                at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.init(TableInputFormatBase.java:139)

                at org.apache.hadoop.hbase.mapred.TableInputFormatBase.getRecordReader(TableInputFormatBase.java:260)

                at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getRecordReader(HiveHBaseTableInputFormat.java:119)

                at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:229)

                at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)

                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

                at org.apache.hadoop.mapred.Child.main(Child.java:170)

 

 

 

Here's my workaround that allows me to be able to join.

 

CREATE TABLE hive_users(key string, state string, country string, country_id int);

INSERT OVERWRITE TABLE hive_users SELECT key, state, country, country_id FROM users;

 

CREATE TABLE hive_states(key string, name string);

INSERT OVERWRITE TABLE hive_states SELECT key, name FROM states;

 

CREATE TABLE hive_countries(key string, name string, country string, country_id int);

INSERT OVERWRITE TABLE hive_countries SELECT key, name, country, country_id FROM countries;

 

 

Any of my selects work just fine:

SELECT u.key, u.state, s.name FROM hive_users u JOIN hive_states s ON (u.state = s.key);

SELECT u.key, u.country, c.name FROM hive_users u JOIN hive_countries c ON (u.country = c.key);

 

 

 

When joining one hbase table it will work with some tables, and give errors with others, I'm not seeing anything I'm doing wrong with the select.

 

This works:

SELECT u.key, u.country, c.name FROM users u JOIN hive_countries c ON (u.country = c.key);

 

However while trying other things, I found that this query does not work, and cannot figure out why it'd be any different, maybe I'm doing something wrong.

SELECT u.key, u.country, c.name FROM hive_users u JOIN countries c ON (u.country = c.key);

 

Ended Job = job_201003150937_0118 with errors

 

Task with the most failures(4):

-----

Task ID:

  task_201003150937_0118_m_000001

 

URL:

  http://localhost:50030/taskdetails.jsp?jobid=job_201003150937_0118&tipid=task_201003150937_0118_m_000001

-----

 

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver

 

 

java.lang.IllegalArgumentException: string cannot be null

                at org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:390)

                at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getRecordReader(HiveHBaseTableInputFormat.java:68)

                at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:229)

                at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)

                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

                at org.apache.hadoop.mapred.Child.main(Child.java:170)

 

 

 

 

 

Thanks,

Daniel

Re: Problem when joining hbase tables in hive

Posted by John Sichi <js...@facebook.com>.

Hi Daniel,

Thank you so much for the very detailed test case.  It's a great example of excellent, easy-to-reproduce bug reporting.

The problem shows up for me too; I think it's due to an issue mentioned by Namit while he was reviewing the patch.  The reason it doesn't show up in the unit test is that the unit test populates identical data in the Hive table and the HBase table before exercising the join, whoops.

I've created this issue for tracking purposes:

https://issues.apache.org/jira/browse/HIVE-1257

JVS

On Mar 18, 2010, at 9:50 AM, Daniel Pruckler wrote:

We're just starting with the HBaseIntegration, and are really excited about it, great work on this.  Thanks

We are running into a problem when joining two hbase tables.  It seems to usually work when joining one hbase table with one hive table, but not always.  Let me know if I'm doing something wrong.

My workaround for right now is to create new tables in hive and overwrite that table with the external hbase table, and then join from the new tables.  In our situation this is working as the one large table we've got seems to work just fine in hbase, we just need to overwrite the smaller join tables before doing a join.  I have one join example at the end of this email that fails when joining an hbase and hive table.




Here's what I did:

Create the HBase tables:

create 'users', 'info'
put 'users', 'user1', 'info:state', 'IA'
put 'users', 'user1', 'info:country', 'USA'
put 'users', 'user1', 'info:country_id', '0'

create 'states', 'state'
put 'states', 'IA', 'state:name', 'Iowa'

create 'countries', 'info'
put 'countries', 'USA', 'info:name', 'United States'
put 'countries', 'USA', 'info:country', 'USA'
put 'countries', 'USA', 'info:country_id', '1'



Create the hive tables:

CREATE EXTERNAL TABLE users(key string, state string, country string, country_id int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.key.type" = "string",
"hbase.columns.mapping" = "info:state,info:country,info:country_id",
"hbase.table.name<http://hbase.table.name>" = "users"
);

CREATE EXTERNAL TABLE states(key string, name string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.key.type" = "string",
"hbase.columns.mapping" = "state:name",
"hbase.table.name<http://hbase.table.name>" = "states"
);

CREATE EXTERNAL TABLE countries(key string, name string, country string, country_id int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.key.type" = "string",
"hbase.columns.mapping" = "info:name,info:country,info:country_id",
"hbase.table.name<http://hbase.table.name>" = "countries"
);


Then try to join, this returns no results:
SELECT u.key, u.country, c.name<http://c.name>, c.key FROM users u JOIN countries c ON (u.country = c.key);
OK
Time taken: 26.75 seconds


This returns results, but not what's expected, it seems like it joins the user table with itself
SELECT u.key, u.country, c.name<http://c.name>, c.key FROM users u JOIN countries c ON (u.country = c.country);
OK
user1     USA        NULL      user1
Time taken: 24.867 seconds


This also returns results (it should not as the country_ids do not match)
SELECT u.key, u.country, c.name<http://c.name>, c.key FROM users u JOIN countries c ON (u.country_id = c.country_id);
OK
user1     USA        NULL      user1
Time taken: 22.868 seconds




I would guess this is part of the same issue, trying to join anything in state with user results in an error (state does not have an 'info' column family)
hive> SELECT u.key, u.state, s.name<http://s.name> FROM users u JOIN states s ON (u.state = s.key);
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201003150937_0096, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201003150937_0096
Kill Command = /usr/lib/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201003150937_0096
2010-03-18 10:43:28,730 Stage-1 map = 0%,  reduce = 0%
2010-03-18 10:43:34,908 Stage-1 map = 50%,  reduce = 0%
2010-03-18 10:43:44,020 Stage-1 map = 50%,  reduce = 17%
2010-03-18 10:44:02,146 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201003150937_0096 with errors

Task with the most failures(4):
-----
Task ID:
  task_201003150937_0096_m_000001

URL:
  http://localhost:50030/taskdetails.jsp?jobid=job_201003150937_0096&tipid=task_201003150937_0096_m_000001
-----

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver

org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family state does not exist in region users,,1268926579188 in table {NAME => 'users', FAMILIES => [{NAME => 'info', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
                at org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2476)
                at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1107)
                at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1095)
                at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1846)
                at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
                at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

                at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
                at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
                at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
                at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
                at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
                at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1041)
                at org.apache.hadoop.hbase.client.HTable$ClientScanner.nextScanner(HTable.java:1935)
                at org.apache.hadoop.hbase.client.HTable$ClientScanner.initialize(HTable.java:1855)
                at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:376)
                at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.restart(TableInputFormatBase.java:129)
                at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.init(TableInputFormatBase.java:139)
                at org.apache.hadoop.hbase.mapred.TableInputFormatBase.getRecordReader(TableInputFormatBase.java:260)
                at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getRecordReader(HiveHBaseTableInputFormat.java:119)
                at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:229)
                at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
                at org.apache.hadoop.mapred.Child.main(Child.java:170)



Here's my workaround that allows me to be able to join.

CREATE TABLE hive_users(key string, state string, country string, country_id int);
INSERT OVERWRITE TABLE hive_users SELECT key, state, country, country_id FROM users;

CREATE TABLE hive_states(key string, name string);
INSERT OVERWRITE TABLE hive_states SELECT key, name FROM states;

CREATE TABLE hive_countries(key string, name string, country string, country_id int);
INSERT OVERWRITE TABLE hive_countries SELECT key, name, country, country_id FROM countries;


Any of my selects work just fine:
SELECT u.key, u.state, s.name<http://s.name> FROM hive_users u JOIN hive_states s ON (u.state = s.key);
SELECT u.key, u.country, c.name<http://c.name> FROM hive_users u JOIN hive_countries c ON (u.country = c.key);



When joining one hbase table it will work with some tables, and give errors with others, I'm not seeing anything I'm doing wrong with the select.

This works:
SELECT u.key, u.country, c.name<http://c.name> FROM users u JOIN hive_countries c ON (u.country = c.key);

However while trying other things, I found that this query does not work, and cannot figure out why it'd be any different, maybe I'm doing something wrong.
SELECT u.key, u.country, c.name<http://c.name> FROM hive_users u JOIN countries c ON (u.country = c.key);

Ended Job = job_201003150937_0118 with errors

Task with the most failures(4):
-----
Task ID:
  task_201003150937_0118_m_000001

URL:
  http://localhost:50030/taskdetails.jsp?jobid=job_201003150937_0118&tipid=task_201003150937_0118_m_000001
-----

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver


java.lang.IllegalArgumentException: string cannot be null
                at org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:390)
                at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getRecordReader(HiveHBaseTableInputFormat.java:68)
                at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:229)
                at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
                at org.apache.hadoop.mapred.Child.main(Child.java:170)





Thanks,
Daniel