You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Leo Alekseyev <dn...@gmail.com> on 2010/10/06 19:58:06 UTC

Is it possible to speed up select ... limit query?

Suppose I have a large-ish table (over a billion rows) and want to
grab the first 5 million or so.  When I run the query "create table
foo_subset as select col1, col2, col3 from foo limit 5000000", the job
launches one reducer, which runs for a while.  Looking at
HDFS_BYTES_READ/WRITTEN, I see that the reducer read the entire table
(250 gigs) to write 700 megs.  This seems wasteful.  Is there a reason
that a reducer can't somehow just count the records and terminate
after it writes 5000000 or whatever the limit is?  Are there
workarounds to make the process faster?..

Thanks,

--Leo