You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Adam Laiacano <ad...@tumblr.com> on 2012/05/30 22:59:10 UTC

Hive inconsistently interpreting 'where' and 'group by'

Hi all,

I have an activity log stored in an external Hive table, LZO-compressed, and partitioned by 'dt' which is the date that the data was recorded. Because of time zones and when we dump the data into HDFS, there are about 22 hours of one day and 2 of the following in each partition. In the example below, 95% of 

The first column is a timestamp ("ts", bigint). I want to count the number of actions on each day within a single partition with this query:

    SELECT TO_DATE(FROM_UNIXTIME(ts)) AS day, count(*) FROM
        activity_log
        WHERE
        dt = "2012-05-29"
        GROUP BY TO_DATE(FROM_UNIXTIME(ts))
    
    
The correct results are (I confirmed these by decompressing the files locally and counting with a simple python script):

    NULL 201
    2012-05-29 80677204
    2012-05-30 3826101
    
The NULL is caused by a relatively tiny number of corrupt rows. I sometimes get this with the exact same query:

    NULL 201
    2012-05-29 84503305
    
It seems to convert all of the non-null timestamps to 2012-05-29. I'd say that it returns the wrong value 1 out of every 5 times.

I've tried other things like 

   SELECT COUNT(*) FROM activity_log WHERE dt="2012-05-29" AND  ts < 1338354000 AND ts >1338267600) 

but it seems to have the same problem with interpreting the ts value consistently.

Any tips or explanations would be greatly appreciated. I'm not sure if the corrupt rows have something to do with it, but I still don't get why that only happens sometimes. 

Thanks,
Adam