You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Xing Wu <xi...@outlook.com> on 2013/08/06 00:24:33 UTC

Inconsistent results with and without index. Is this a bug?

Hive Dev Team,


Greetings!

We have encountered some issue when using Hive 0.8.1.8 and Hive 0.11.0. After some investigation, we think this looks like a bug in Hive. I'm therefore sending this email to report this issue and to confirm with you. Please let me know if this is not the correct mailing list for this kind of topic. 

The issue we had is related to indexed queries on external tables stored as sequence file. For example, if we have a simple table like the one created below, 

CREATE TABLE hive_test
(
id int,
name string,
info string
)
STORED AS SEQUENCEFILE; 

We first insert 5000 rows with the same id (e.g., id = 1) into this table. We then count the total number of rows in this table by running the query below and get the correct result 5000.

select count(*) from hive_test where id = 1;

After this, we create an index on id,

CREATE INDEX test_index ON TABLE hive_test(id) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD;
ALTER INDEX test_index ON hive_test REBUILD;
set hive.optimize.index.filter=true;
set hive.optimize.index.filter.compact.minsize=0;

Then, we run the same query 'select count(*) from hive_test where id = 1;' again but get a different result (count > 5000). 

We tried to dig into the Hive source code and found the following piece of code in HiveIndexedInputFormat.java which might be the root cause of the duplicated rows,

if (split.inputFormatClassName().contains("RCFile") || split.inputFormatClassName().contains("SequenceFile")) {
    if (split.getStart() > SequenceFile.SYNC_INTERVAL) {
        newSplit = new HiveInputSplit(new FileSplit(split.getPath(),
                split.getStart() - SequenceFile.SYNC_INTERVAL,
                split.getLength() + SequenceFile.SYNC_INTERVAL,
                split.getLocations()),
                split.inputFormatClassName());
    }
}

According to my understanding on SequenceFile and SequenceFileRecordReader, I think it's unnecessary and incorrect to add the extra 2000 bytes to the beginning of each input split because it actually causes some of the rows in the overlapping regions to be processed by two mappers. Please correct me if I'm wrong. 


Thank you,
Xing