You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "yangfang (JIRA)" <ji...@apache.org> on 2016/01/15 07:52:39 UTC

[jira] [Commented] (HIVE-12877) Hive use index for queries will lose some data if the Query file is compressed.

    [ https://issues.apache.org/jira/browse/HIVE-12877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101328#comment-15101328 ] 

yangfang commented on HIVE-12877:
---------------------------------

I create partitioned table and load .gz file into the partition:
CREATE EXTERNAL TABLE IF NOT EXISTS if_pmt_note_staging (
apsdactno string, date_tr string, apsdjrnno string, apsdseqno string, province_code string
) partitioned by (batch_id string);

alter table if_pmt_note_staging add partition (batch_id='201510') location '/hive/if_pmt_note_staging';

The location '/hive/if_pmt_note_staging' has some .gz files. such as 1.gz,2.gz and so on

then I create index:

CREATE INDEX index_if_pmt_note_staging_date_tr 
ON TABLE if_pmt_note_staging (date_tr) 
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
WITH DEFERRED REBUILD 
IN TABLE t_index_if_pmt_note_staging_date_tr; 

alter index index_if_pmt_note_staging_date_tr on if_pmt_note_staging rebuild; 

CREATE INDEX index_apsh_province_code_apsdprocod_apsdactno_tr
ON TABLE apsh (apsdprocod) 
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
WITH DEFERRED REBUILD
IN TABLE t_index_province_code_apsdprocod_apsdactno_tr;

when I excute query use the the index date_tr:
select * from if_pmt_note_staging where date_tr='20121205';

I found that some of the data should be queried without query. such as the matched data in the 3.gz file

The hive logs print as follows:
split start : 10336
split end : 59916
...................

It is true that hiveIndexResult.contains function Filter out some files in the HiveIndexedInputFormat,  the function list as below:

  public boolean contains(FileSplit split) throws HiveException {
  
    ....................
    for (Long offset : bucket.getOffsets()) {
      if ((offset >= split.getStart())
          && (offset <= split.getStart() + split.getLength())) {
        return true;
      }
    }
   }
the offset length  is the length of the file after decompression ,but the split.getLength() is the length of the file before decompression. so some files may filter out by this function.
It seemed this section of code isn't necessary, we can delete it. 

> Hive use index for queries will lose some data if the Query file is compressed.
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-12877
>                 URL: https://issues.apache.org/jira/browse/HIVE-12877
>             Project: Hive
>          Issue Type: Bug
>          Components: Indexing
>    Affects Versions: 1.2.1
>         Environment: This problem exists in all Hive versions.no matter what platform
>            Reporter: yangfang
>
> Hive created the index using the extracted file length when the file is  the compressed,
> but when to divide the data into pieces in MapReduce,Hive use the file length to compare with the extracted file length,if
> If it found that these two lengths are not matched, It filters out the file.So the query will lose some data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)