You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Lars Francke <la...@gmail.com> on 2012/06/21 17:42:55 UTC

Problem with NULLs in HBase "leaking" into following rows

Hi,

we're using the HBase integration in Hive 0.9 and are running into
problems when there are rows with NULL values (which would map to a
non-existing cell in HBase).

We're using a UDF[1] but see the same behavior without it.

Just as an example table we have just two rows

In HBase Shell:

create 'hive_hbase_test', 'test'
put 'hive_hbase_test', '1', 'test:c1', 'c1-1'
put 'hive_hbase_test', '1', 'test:c2', 'c2-1'
put 'hive_hbase_test', '1', 'test:c3', 'c3-1'
put 'hive_hbase_test', '2', 'test:c1', 'c1-2'

In Hive:

DROP TABLE IF EXISTS hive_hbase_test;
CREATE EXTERNAL TABLE hive_hbase_test (
  id int,
  c1 string,
  c2 string,
  c3 string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key#s,test:c1#s,test:c2#s,test:c3#s")
TBLPROPERTIES("hbase.table.name" = "hive_hbase_test");

hive> select * from hive_hbase_test;
OK
1	c1-1	c2-1	c3-1
2	c1-2	NULL	NULL

hive> select c1 from hive_hbase_test;
c1-1
c1-2

hive> select c1, c2 from hive_hbase_test;
c1-1	c2-1
c1-2	NULL

So far everything is correct but now:

hive> select c1, c2, c2 from hive_hbase_test;
c1-1	c2-1	c2-1
c1-2	NULL	c2-1

Selecting c2 twice works the first time but the second time we
actually get the value from the previous row.

hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test;
c1-1	c3-1	c2-1	c2-1	c3-1	c3-1	c1-1
c1-2	NULL	NULL	c2-1	c3-1	c3-1	c1-2

This works with a "native" HDFS backed table.

In our UDF we were started logging (this UDF gets a year, month and
day and any of those might be null) and tested a simple two row table.

hive> SELECT id, year, month, parseDate(year, month, day) FROM
naughty_occurrence;

First row (data in HBase, 1997-1-1):
deferred: [1997] - convertedObject: [1997]
deferred: [1] - convertedObject: [1]
deferred: [1] - convertedObject: [1]
Year: [1997], Month: [1], Day: [1]

Second row (data in HBase: 2006-null-null):
deferred: [2006] - convertedObject: [2006]
deferred: [1] - convertedObject: [1]
deferred: [1] - convertedObject: [null]
Year: [2006], Month: [1], Day: [null]

I know this looks very confusing and I hope I haven't overdone it with
the examples but this seems like a rather serious problem with the
HBase integration. Values from previous rows are "leaking" into null
values in following rows. We're not 100% sure if we're doing something
wrong but I don't see what we could do wrong here. I'll open an issue
if no one has an idea what's going on here. Tried looking at the HBase
Handler code but was confused by it. Will try again tomorrow.

Thanks for bearing with me.

Cheers,
Lars

[1] I would very much appreciate a review of our usage of
DeferredObjects etc.:
<https://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/java/org/gbif/occurrencestore/hive/udf/DateParsingUDF.java>

Re: Problem with NULLs in HBase "leaking" into following rows

Posted by Lars Francke <la...@gmail.com>.
We've figured out that this is indeed a bug and opened
https://issues.apache.org/jira/browse/HIVE-3179 for this and will
provide a fix.

Cheers,
Lars