You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Jan Van Besien (JIRA)" <ji...@apache.org> on 2013/12/03 15:25:35 UTC

[jira] [Updated] (HIVE-5927) wrong start/stop key on hbase scan with inner join and where clause on id

     [ https://issues.apache.org/jira/browse/HIVE-5927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Van Besien updated HIVE-5927:
---------------------------------

    Description: 
Given two hbase tables (hbase shell commands shown to create and populate them):

{code}
create 'tablea', 'data'
create 'tableb', 'data'
put 'tablea', 'a', 'data:linkb', 'b'
put 'tableb', 'b', 'data:linka', 'a'
{code}

And given two corresponding hive table definitions:

{code}
create external table tablea(rowkey string, linkb string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES ('hbase.columns.mapping' = ':key,data:linkb') tblproperties ('hbase.table.name'='tablea');
create external table tableb(rowkey string, linka string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES ('hbase.columns.mapping' = ':key,data:linka') tblproperties ('hbase.table.name'='tableb');
{code}

These two queries return no results while they should return a single result:

{code}
select * from tablea join tableb on tablea.linkb = tableb.rowkey where tablea.linkb = 'b';
select * from tablea join tableb on tablea.linkb = tableb.rowkey where tableb.rowkey = 'b';
{code}

For reference, this works:

{code}
select * from tablea join tableb on tablea.linkb = tableb.rowkey;
+---------+--------+---------+--------+
| rowkey  | linkb  | rowkey  | linka  |
+---------+--------+---------+--------+
| a       | b      | b       | a      |
+---------+--------+---------+--------+
{code}

I think the problem is related to how the HBaseStorageHandler builds scans. The failing queries result in two scans, one for each table. However, both scans seem to be configured with start and stop row keys = 'b'. This is only correct for the scan over tableb. This seems to be caused by the TableScanDesc.FILTER_EXPR_CONF_STR on the job configuration, which is set to the same value for both cases.

  was:
Given two hbase tables (hbase shell commands shown to create and populate them):

create 'tablea', 'data'
create 'tableb', 'data'
put 'tablea', 'a', 'data:linkb', 'b'
put 'tableb', 'b', 'data:linka', 'a'

And given two corresponding hive table definitions:

create external table tablea(rowkey string, linkb string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES ('hbase.columns.mapping' = ':key,data:linkb') tblproperties ('hbase.table.name'='tablea');
create external table tableb(rowkey string, linka string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES ('hbase.columns.mapping' = ':key,data:linka') tblproperties ('hbase.table.name'='tableb');

These two queries return no results while they should return a single result:

select * from tablea join tableb on tablea.linkb = tableb.rowkey where tablea.linkb = 'b';
select * from tablea join tableb on tablea.linkb = tableb.rowkey where tableb.rowkey = 'b';

For reference, this works:

select * from tablea join tableb on tablea.linkb = tableb.rowkey;
+---------+--------+---------+--------+
| rowkey  | linkb  | rowkey  | linka  |
+---------+--------+---------+--------+
| a       | b      | b       | a      |
+---------+--------+---------+--------+


I think the problem is related to how the HBaseStorageHandler builds scans. The failing queries result in two scans, one for each table. However, both scans seem to be configured with start and stop row keys = 'b'. This is only correct for the scan over tableb. This seems to be caused by the TableScanDesc.FILTER_EXPR_CONF_STR on the job configuration, which is set to the same value for both cases.


> wrong start/stop key on hbase scan with inner join and where clause on id
> -------------------------------------------------------------------------
>
>                 Key: HIVE-5927
>                 URL: https://issues.apache.org/jira/browse/HIVE-5927
>             Project: Hive
>          Issue Type: Bug
>          Components: HBase Handler
>    Affects Versions: 0.10.0
>            Reporter: Jan Van Besien
>
> Given two hbase tables (hbase shell commands shown to create and populate them):
> {code}
> create 'tablea', 'data'
> create 'tableb', 'data'
> put 'tablea', 'a', 'data:linkb', 'b'
> put 'tableb', 'b', 'data:linka', 'a'
> {code}
> And given two corresponding hive table definitions:
> {code}
> create external table tablea(rowkey string, linkb string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES ('hbase.columns.mapping' = ':key,data:linkb') tblproperties ('hbase.table.name'='tablea');
> create external table tableb(rowkey string, linka string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES ('hbase.columns.mapping' = ':key,data:linka') tblproperties ('hbase.table.name'='tableb');
> {code}
> These two queries return no results while they should return a single result:
> {code}
> select * from tablea join tableb on tablea.linkb = tableb.rowkey where tablea.linkb = 'b';
> select * from tablea join tableb on tablea.linkb = tableb.rowkey where tableb.rowkey = 'b';
> {code}
> For reference, this works:
> {code}
> select * from tablea join tableb on tablea.linkb = tableb.rowkey;
> +---------+--------+---------+--------+
> | rowkey  | linkb  | rowkey  | linka  |
> +---------+--------+---------+--------+
> | a       | b      | b       | a      |
> +---------+--------+---------+--------+
> {code}
> I think the problem is related to how the HBaseStorageHandler builds scans. The failing queries result in two scans, one for each table. However, both scans seem to be configured with start and stop row keys = 'b'. This is only correct for the scan over tableb. This seems to be caused by the TableScanDesc.FILTER_EXPR_CONF_STR on the job configuration, which is set to the same value for both cases.



--
This message was sent by Atlassian JIRA
(v6.1#6144)