You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by ZhiHong Fu <dd...@gmail.com> on 2009/03/12 07:51:34 UTC

how to optimize mapreduce procedure??

Hello,

           I'm writing a program which will finish lucene searching in
about 12 index directorys, all of them are stored in HDFS. It is done
like this:
1. We get about 12 index Directorys through lucene index
functionality, each of which about 100M size,
2. We store these 12 index directorys on hadoop HDFS , and this hadoop
cluster is made up of one namenode and five datanodes,totally 6
computers.
3. And then I will do lucene searching for these 12 index directorys,
The mapreduce methods are as follows:
    Map Procedure: 12 index directory will be splitted into
numOfMapTasks,for example, if numOfMapTasks=3, then each map we will
get 4 indexDirs and store them in an Intermediate Result.
    Combine Procedure: for a intermediate Result locally, we will do
really lucene search in its containing index directory. and then store
these hit result in the intermediate Result.
    Reduce Procedure: Reduce the Intermediate Results' hit result. and
get the search Result.

But when I implement like this, I have a performance problem, I set
numOfMapTasks and numOfReduceTasks to any value,such as
numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will
spend about 28 seconds, and Obviously It is unacceptable.
So I'm confused whether I did wrong map-reduce procedure or set wrong
num of map or reduce tasks. and generally where the overhead of
mapreduce proceduce will take place. Any suggestion will be
appreciated.
Thanks.

Re: how to optimize mapreduce procedure??

Posted by Ning Li <ni...@gmail.com>.
I would agree with Enis. MapReduce is good for batch building large
indexes, but not for search which requires realtime response.

Cheers,
Ning


On Fri, Mar 13, 2009 at 10:58 AM, Enis Soztutar <en...@gmail.com> wrote:
> ZhiHong Fu wrote:
>>
>> Hello,
>>
>>           I'm writing a program which will finish lucene searching in
>> about 12 index directorys, all of them are stored in HDFS. It is done
>> like this:
>> 1. We get about 12 index Directorys through lucene index
>> functionality, each of which about 100M size,
>> 2. We store these 12 index directorys on hadoop HDFS , and this hadoop
>> cluster is made up of one namenode and five datanodes,totally 6
>> computers.
>> 3. And then I will do lucene searching for these 12 index directorys,
>> The mapreduce methods are as follows:
>>    Map Procedure: 12 index directory will be splitted into
>> numOfMapTasks,for example, if numOfMapTasks=3, then each map we will
>> get 4 indexDirs and store them in an Intermediate Result.
>>    Combine Procedure: for a intermediate Result locally, we will do
>> really lucene search in its containing index directory. and then store
>> these hit result in the intermediate Result.
>>    Reduce Procedure: Reduce the Intermediate Results' hit result. and
>> get the search Result.
>>
>> But when I implement like this, I have a performance problem, I set
>> numOfMapTasks and numOfReduceTasks to any value,such as
>> numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will
>> spend about 28 seconds, and Obviously It is unacceptable.
>> So I'm confused whether I did wrong map-reduce procedure or set wrong
>> num of map or reduce tasks. and generally where the overhead of
>> mapreduce proceduce will take place. Any suggestion will be
>> appreciated.
>> Thanks.
>>
>
> Keeping the indexes at HDFS is not the best choice. Moreover mapreduce does
> not fit into the problem of distributed search over several nodes. The
> overhead of
> staring a new job for every search is not acceptable.
> You can use nutch distributed search or katta(not sure about the name)
> for this.
>

Re: how to optimize mapreduce procedure??

Posted by Enis Soztutar <en...@gmail.com>.
ZhiHong Fu wrote:
> Hello,
>
>            I'm writing a program which will finish lucene searching in
> about 12 index directorys, all of them are stored in HDFS. It is done
> like this:
> 1. We get about 12 index Directorys through lucene index
> functionality, each of which about 100M size,
> 2. We store these 12 index directorys on hadoop HDFS , and this hadoop
> cluster is made up of one namenode and five datanodes,totally 6
> computers.
> 3. And then I will do lucene searching for these 12 index directorys,
> The mapreduce methods are as follows:
>     Map Procedure: 12 index directory will be splitted into
> numOfMapTasks,for example, if numOfMapTasks=3, then each map we will
> get 4 indexDirs and store them in an Intermediate Result.
>     Combine Procedure: for a intermediate Result locally, we will do
> really lucene search in its containing index directory. and then store
> these hit result in the intermediate Result.
>     Reduce Procedure: Reduce the Intermediate Results' hit result. and
> get the search Result.
>
> But when I implement like this, I have a performance problem, I set
> numOfMapTasks and numOfReduceTasks to any value,such as
> numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will
> spend about 28 seconds, and Obviously It is unacceptable.
> So I'm confused whether I did wrong map-reduce procedure or set wrong
> num of map or reduce tasks. and generally where the overhead of
> mapreduce proceduce will take place. Any suggestion will be
> appreciated.
> Thanks.
>   
Keeping the indexes at HDFS is not the best choice. Moreover mapreduce does
not fit into the problem of distributed search over several nodes. The 
overhead of
staring a new job for every search is not acceptable.
You can use nutch distributed search or katta(not sure about the name)
for this.