You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Szehon Ho (JIRA)" <ji...@apache.org> on 2013/12/20 02:02:33 UTC

[jira] [Updated] (HIVE-5774) INSERT OVERWRITE DYNAMIC PARTITION on LARGE DATA

     [ https://issues.apache.org/jira/browse/HIVE-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Szehon Ho updated HIVE-5774:
----------------------------

    Assignee:     (was: Szehon Ho)

> INSERT OVERWRITE DYNAMIC PARTITION on LARGE DATA
> ------------------------------------------------
>
>                 Key: HIVE-5774
>                 URL: https://issues.apache.org/jira/browse/HIVE-5774
>             Project: Hive
>          Issue Type: Bug
>          Components: Database/Schema
>         Environment: debian 6.0.7
>            Reporter: Danny Teok
>            Priority: Critical
>              Labels: dynamic, hive, insert, overwrite, partition
>
> After several forensic analysis, we are convinced that there is a bug when rebuilding using dynamic partition over more than 30 days. Row counts do not match.
> In details:
> Part A -- original_table
> 2013-01-01; 394,755 rows
> 2013-01-02; 424,448
> 2013-01-03; 427,201
> ...
> 2013-10-30; 3,234,472
> Part B -- copy_of_original_table_new
> 2013-01-01; 372,628 rows
> 2013-01-02; 400,553
> 2013-01-03; 403,495
> ...
> 2013-10-30; 2,865,877
> The query that is used to populate the original table is the same for populating the "copy_of_original_table_new" table. When we rebuilt for 1 day, e.g. 2013-01-01, the number of row counts of the copy_of_original_table_new  matched up exactly with orignal_table.
> When we rebuilt for 7 days, the number of row counts matched up exactly.
> When we rebuilt for 15 days, the number of row counts matched up exactly.
> When we rebuilt for 303 days (10 months), everything fxxked up. No matches.
> When we rebuilt for 35 days, 80% matched up exactly. The other 20% are out from hundreds to tens of thousands of rows (a variance of up to 3%)
> In other words, the more days that are specified in the WHERE dt BETWEEN dateStart AND dateEnd, the dates will be out, i.e. no matching row count with original_table.
> However, of those 20% that are out, we rebuilt each of them statically with the corresponding date. The result is astonishingly surprising -- they matched the original_table row count!
> Apologize in advance if this is not technical enough, but I hope the message is clear. We believe there is a bug. Not sure how to check our Hive version, but our Hadoop's version is "Hadoop 2.0.0-cdh4.1.1"
> For a glimpse of the INSERT OVERWRITE sql, it's here -- http://pastebin.com/g1qxsUm2



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)