You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Attila Magyar <am...@hortonworks.com> on 2019/09/09 15:27:50 UTC

Review Request 71456: select count gives incorrect result after loading data from text file

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71456/
-----------------------------------------------------------

Review request for hive, Ashutosh Chauhan, Jesús Camacho Rodríguez, and Slim Bouguerra.


Bugs: HIVE-22055
    https://issues.apache.org/jira/browse/HIVE-22055


Repository: hive-git


Description
-------

This happens when tez.grouping.min-size is set to a small value (for example 1) so that the split size that is calculated from the file size is going to be used. This changes as the table grows and different split sizes will be used while doing each selects.

load 90 records from f1
select count(1) gives back 90
load 90 records from f2
select count(1) gives back 172 // 8 records missing


When running the second select the split size is larger, and SerDeLowLevelCacheImpl is already populated with stripes from the first select (and by that tiem split size was smaller).


There is problem with how LineRecordReader works togeather with the cache. So if a larger split is requested and an overlapping smaller one is already in the cache, then SerDeEncodedDataReader'll try to extend the existing split by reading the 
difference between the large and the small split. But it'll start reading right after the last stripe pyhsically ends,
and LineRecordReader always skips the first row, unless we are at the beginning of the file. So this line skipping behaviour is not considered at one point and that's why some rows are missing.


Diffs
-----

  itests/src/test/resources/testconfiguration.properties 98280c52fe9 
  llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java 462b25fa234 
  ql/src/test/queries/clientpositive/mm_loaddata_split_change.q PRE-CREATION 
  ql/src/test/results/clientpositive/llap/mm_loaddata_split_change.q.out PRE-CREATION 


Diff: https://reviews.apache.org/r/71456/diff/1/


Testing
-------

with q test


Thanks,

Attila Magyar