You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "eye (JIRA)" <ji...@apache.org> on 2013/10/18 12:14:49 UTC
[jira] [Updated] (HIVE-5590) select and get duplicated records with
hive when a .defalte file greater than 64MB was loaded to a hive table
[ https://issues.apache.org/jira/browse/HIVE-5590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
eye updated HIVE-5590:
----------------------
Description:
we occasionally have some compressed file larger than 160MB in .deflate format. And it was load to hive using an external table, say table T_A.
when select count(*) from T_A we got more records,70% more! compared with that we use "hadoop fs -text /xxxxx |wc -l" to check the file.
any clue for this? how could it happened?
the large .deflate file was due to imperfect processing , when we fixed it and get files less than 64M. the above problem did not come up. But since it is not guaranteed that a larger file would not show up again. is there any way to avoid this subject ?
cheers!
eye
was:
we occasionally have some compressed file larger than 160MB in .deflate format. And it was load to hive using an external table, say table T_A.
when select count(*) from T_A we got more records,70% more! compared with that we use "hadoop fs -text /xxxxx |wc -l" to check the file.
any clue for this?
the large .deflate file was due to imperfect processing , when we fixed it and get files less than 64M. the above problem did not come up. But since it is not guaranteed that a larger file would not show up again. is there any way to avoid this subject ?
cheers!
eye
> select and get duplicated records with hive when a .defalte file greater than 64MB was loaded to a hive table
> -------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-5590
> URL: https://issues.apache.org/jira/browse/HIVE-5590
> Project: Hive
> Issue Type: Bug
> Environment: cdh4
> Reporter: eye
> Labels: 64M, count(*), duplited, hdfs, hive, records
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> we occasionally have some compressed file larger than 160MB in .deflate format. And it was load to hive using an external table, say table T_A.
> when select count(*) from T_A we got more records,70% more! compared with that we use "hadoop fs -text /xxxxx |wc -l" to check the file.
> any clue for this? how could it happened?
> the large .deflate file was due to imperfect processing , when we fixed it and get files less than 64M. the above problem did not come up. But since it is not guaranteed that a larger file would not show up again. is there any way to avoid this subject ?
> cheers!
> eye
--
This message was sent by Atlassian JIRA
(v6.1#6144)