You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Neal Richter <nr...@gmail.com> on 2009/07/23 17:54:00 UTC

Select distinct and partitions

Hello,

 Below is a Hive SQL example of what I'd say is a logic bug of the
unnecessary-extra-processing variety.

CREATE TABLE tmp_pageviews(
  user_id INT,
  count INT)
PARTITIONED BY (timeslot STRING)
ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ','
   LINES TERMINATED BY '\012'
   LOCATION '/user/hive/warehouse/test/tmp_pageviews/';

LOAD DATA LOCAL INPATH './2009-07-21/tmp_pageviews.csv' OVERWRITE INTO
TABLE tmp_pageviews PARTITION (timeslot='2009-07-21');
LOAD DATA LOCAL INPATH './2009-07-20/tmp_pageviews.csv' OVERWRITE INTO
TABLE tmp_pageviews PARTITION (timeslot='2009-07-20');
LOAD DATA LOCAL INPATH './2009-07-19/tmp_pageviews.csv' OVERWRITE INTO
TABLE tmp_pageviews PARTITION (timeslot='2009-07-19');
LOAD DATA LOCAL INPATH './2009-07-18/tmp_pageviews.csv' OVERWRITE INTO
TABLE tmp_pageviews PARTITION (timeslot='2009-07-18');

SELECT COUNT(1) FROM tmp_pageviews WHERE timeslot='2009-07-21';

{proper results output from map-reduce job}

SELECT COUNT(1) FROM tmp_pageviews WHERE timeslot='2009-07-22';

{properly refuses to run an query as the requested partition is not in
the table's meta-data}

SELECT DISTINCT(timeslot) from tmp_pageviews;

{proper output from map-reduce job}

Why does the last query even bother to run a map-reduce job?
Everything it needs to return results are in the table's meta-data
area.  I suspect that it's kind of a special case.

Why is this important to me?  The same thing happens if I make the
above table an EXTERNAL one and individually add partitions where
LOCATIONs are S3 paths pointing to raw CSV logs from an application.
In that case the dataset is huge and I can't reasonably ask Hive to
tell me what partitions have been added via DISTINCT(timeslot).

Thanks - Neal

RE: Select distinct and partitions

Posted by Namit Jain <nj...@facebook.com>.
I agree that the last query can be optimized but currently it is not.

The workaround is to use a different API


show partitions tmp_pageviews;


Thanks,
-namit

-----Original Message-----
From: Neal Richter [mailto:nrichter@gmail.com] 
Sent: Thursday, July 23, 2009 8:54 AM
To: hive-user
Subject: Select distinct and partitions

Hello,

 Below is a Hive SQL example of what I'd say is a logic bug of the
unnecessary-extra-processing variety.

CREATE TABLE tmp_pageviews(
  user_id INT,
  count INT)
PARTITIONED BY (timeslot STRING)
ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ','
   LINES TERMINATED BY '\012'
   LOCATION '/user/hive/warehouse/test/tmp_pageviews/';

LOAD DATA LOCAL INPATH './2009-07-21/tmp_pageviews.csv' OVERWRITE INTO
TABLE tmp_pageviews PARTITION (timeslot='2009-07-21');
LOAD DATA LOCAL INPATH './2009-07-20/tmp_pageviews.csv' OVERWRITE INTO
TABLE tmp_pageviews PARTITION (timeslot='2009-07-20');
LOAD DATA LOCAL INPATH './2009-07-19/tmp_pageviews.csv' OVERWRITE INTO
TABLE tmp_pageviews PARTITION (timeslot='2009-07-19');
LOAD DATA LOCAL INPATH './2009-07-18/tmp_pageviews.csv' OVERWRITE INTO
TABLE tmp_pageviews PARTITION (timeslot='2009-07-18');

SELECT COUNT(1) FROM tmp_pageviews WHERE timeslot='2009-07-21';

{proper results output from map-reduce job}

SELECT COUNT(1) FROM tmp_pageviews WHERE timeslot='2009-07-22';

{properly refuses to run an query as the requested partition is not in
the table's meta-data}

SELECT DISTINCT(timeslot) from tmp_pageviews;

{proper output from map-reduce job}

Why does the last query even bother to run a map-reduce job?
Everything it needs to return results are in the table's meta-data
area.  I suspect that it's kind of a special case.

Why is this important to me?  The same thing happens if I make the
above table an EXTERNAL one and individually add partitions where
LOCATIONs are S3 paths pointing to raw CSV logs from an application.
In that case the dataset is huge and I can't reasonably ask Hive to
tell me what partitions have been added via DISTINCT(timeslot).

Thanks - Neal