You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by David Engel <da...@istwok.net> on 2013/12/06 23:38:49 UTC

MIN/MAX issue with timestamps and RCFILE/ORC tables

Hi,

Because of the known, and believed fixed, issue with MIN/MAX
(HIVE-4931), we're using a recent (2013-12-02), locally built version
of Hive 0.13.0-SNAPSHOT.  Unfortunately, we're still seeing issues
using MIN/MAX on timestamp types when using RCFILE and ORC formatted
tables.  I could not find a reference to this problem in the Hive
JIRA, but I'm posting here first before opening a new report in JIRA.

Here's an example of what we're seeing.

The dnstext table is a very simplified variation of part of our
application and has the following definition.

3: jdbc:hive2://host:10000> describe dnstext ;
+-----------------------+-----------------------+-----------------------+
|       col_name        |       data_type       |        comment        |
+-----------------------+-----------------------+-----------------------+
| circuitid             | string                | None                  |
| rrname                | string                | None                  |
| srcaddr               | string                | None                  |
| dstaddr               | string                | None                  |
| occurs                | bigint                | None                  |
| firstseen             | timestamp             | None                  |
| lastseen              | timestamp             | None                  |
| histo                 | bigint                | None                  |
+-----------------------+-----------------------+-----------------------+

It is stored in TEXTFILE format.  There are also dnsrc and dnsorc
tables with the same column definition and data, but are stored in
RCFILE and ORC formats, respectively.

The following queries, one that aggregates the entire table and
another taht aggregates with a GROUP BY clause, work as expected on
the dnstext table.  The second query is wrapped in an outer query to
reduce the final number of rows for this example.

3: jdbc:hive2://host:10000> select count(1), sum(occurs), min(firstseen), max(lastseen) from dnstext ;
+-----------+------------+--------------------------+--------------------------+
|    _c0    |    _c1     |           _c2            |           _c3            |
+-----------+------------+--------------------------+--------------------------+
| 67110469  | 171344130  | 2013-09-05 23:59:28.977  | 2013-09-07 00:13:12.765  |
+-----------+------------+--------------------------+--------------------------+

3: jdbc:hive2://host:10000> select rrname, count, occurs, firstseen, lastseen from ( select rrname, count(1) count, sum(occurs) occurs, min(firstseen) firstseen, max(lastseen) lastseen from dnstext group by rrname ) t where rrname like 'spread%.google.com' order by rrname ;
+----------------------------+--------+---------+--------------------------+--------------------------+
|           rrname           | count  | occurs  |        firstseen         |         lastseen         |
+----------------------------+--------+---------+--------------------------+--------------------------+
| spreadsheets.google.com    | 217    | 228     | 2013-09-06 01:00:50.653  | 2013-09-07 00:03:49.82   |
| spreadsheets.l.google.com  | 263    | 272     | 2013-09-06 00:08:20.442  | 2013-09-07 00:08:57.891  |
| spreadsheets0.google.com   | 8      | 8       | 2013-09-06 00:47:01.475  | 2013-09-06 21:19:50.516  |
| spreadsheets2.google.com   | 7      | 9       | 2013-09-06 02:23:41.724  | 2013-09-06 23:12:26.698  |
+----------------------------+--------+---------+--------------------------+--------------------------+

When the same queries are run on the dnsrc table, the first query
returns the correct results, but the second returns incorrect results
for either the min(firstseen) or max(lastseen) columns on every row.

3: jdbc:hive2://host:10000> select count(1), sum(occurs), min(firstseen), max(lastseen) from dnsrc ;  
+-----------+------------+--------------------------+--------------------------+
|    _c0    |    _c1     |           _c2            |           _c3            |
+-----------+------------+--------------------------+--------------------------+
| 67110469  | 171344130  | 2013-09-05 23:59:28.977  | 2013-09-07 00:13:12.765  |
+-----------+------------+--------------------------+--------------------------+

3: jdbc:hive2://host:10000> select rrname, count, occurs, firstseen, lastseen from ( select rrname, count(1) count, sum(occurs) occurs, min(firstseen) firstseen, max(lastseen) lastseen from dnsrc group by rrname ) t where rrname like 'spread%.google.com' order by rrname ;  
+----------------------------+--------+---------+--------------------------+--------------------------+
|           rrname           | count  | occurs  |        firstseen         |         lastseen         |
+----------------------------+--------+---------+--------------------------+--------------------------+
| spreadsheets.google.com    | 217    | 228     | 2013-09-06 01:00:50.653  | 2013-09-07 00:08:57.891  |
| spreadsheets.l.google.com  | 263    | 272     | 2013-09-06 00:47:01.475  | 2013-09-07 00:08:57.891  |
| spreadsheets0.google.com   | 8      | 8       | 2013-09-06 00:47:01.475  | 2013-09-06 23:12:26.745  |
| spreadsheets2.google.com   | 7      | 9       | 2013-09-06 01:00:50.653  | 2013-09-06 23:12:26.745  |
+----------------------------+--------+---------+--------------------------+--------------------------+

When the same queries are run on the dnsorc table, the both queries
return incorrect results for the min(firstseen) or max(lastseen)
columns on every row.

3: jdbc:hive2://host:10000> select count(1), sum(occurs), min(firstseen), max(lastseen) from dnsorc ;
+-----------+------------+--------------------------+--------------------------+
|    _c0    |    _c1     |           _c2            |           _c3            |
+-----------+------------+--------------------------+--------------------------+
| 67110469  | 171344130  | 2013-09-06 03:19:42.726  | 2013-09-07 00:12:49.449  |
+-----------+------------+--------------------------+--------------------------+

3: jdbc:hive2://host:10000> select rrname, count, occurs, firstseen, lastseen from ( select rrname, count(1) count, sum(occurs) occurs, min(firstseen) firstseen, max(lastseen) lastseen from dnsorc group by rrname ) t where rrname like 'spread%.google.com' order by rrname ;
+----------------------------+--------+---------+--------------------------+--------------------------+
|           rrname           | count  | occurs  |        firstseen         |         lastseen         |
+----------------------------+--------+---------+--------------------------+--------------------------+
| spreadsheets.google.com    | 217    | 228     | 2013-09-06 03:19:42.726  | 2013-09-07 00:12:49.449  |
| spreadsheets.l.google.com  | 263    | 272     | 2013-09-06 03:19:42.726  | 2013-09-07 00:12:49.449  |
| spreadsheets0.google.com   | 8      | 8       | 2013-09-06 03:19:42.726  | 2013-09-06 21:01:07.743  |
| spreadsheets2.google.com   | 7      | 9       | 2013-09-06 03:19:42.726  | 2013-09-06 13:13:19.84   |
+----------------------------+--------+---------+--------------------------+--------------------------+

David
-- 
David Engel
david@istwok.net