You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Philippe Girolami <ph...@girolami.org> on 2011/03/09 00:00:50 UTC

Is it possible to run a query over multiple cores for a (small) dataset in local mode ?

Hi,

I am testing the Hive 0.6 on parts of my data set. It's only a couple GB of
log files that I am reading through a custom SerDe. The table is
partitionned. I am using Hadoop local mode for testing.

When I run simple Group By queries (4 MR jobs), I am getting logs such as

   - map : 100%
   - reduce : 0%
   - map : 85%
   - reduce : 0%
   - map : 86%
   - reduce : 0%

all the while only using one core on an 8 core server. Kind of a waste...

I have activated the parallel option but it still won't parallelize. I have
set the number of reduce jobs to be 8.

My expectations is that since my data set is partitionned (=> different
files), at least some of the map-reduce phases could be run on parallel on
those files.

Is my understanding wrong ? Is there a specific way to write the queries ?

Thanks
Philippe