You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by ZhiHong Fu <dd...@gmail.com> on 2009/03/12 07:51:34 UTC
how to optimize mapreduce procedure??
Hello,
I'm writing a program which will finish lucene searching in
about 12 index directorys, all of them are stored in HDFS. It is done
like this:
1. We get about 12 index Directorys through lucene index
functionality, each of which about 100M size,
2. We store these 12 index directorys on hadoop HDFS , and this hadoop
cluster is made up of one namenode and five datanodes,totally 6
computers.
3. And then I will do lucene searching for these 12 index directorys,
The mapreduce methods are as follows:
Map Procedure: 12 index directory will be splitted into
numOfMapTasks,for example, if numOfMapTasks=3, then each map we will
get 4 indexDirs and store them in an Intermediate Result.
Combine Procedure: for a intermediate Result locally, we will do
really lucene search in its containing index directory. and then store
these hit result in the intermediate Result.
Reduce Procedure: Reduce the Intermediate Results' hit result. and
get the search Result.
But when I implement like this, I have a performance problem, I set
numOfMapTasks and numOfReduceTasks to any value,such as
numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will
spend about 28 seconds, and Obviously It is unacceptable.
So I'm confused whether I did wrong map-reduce procedure or set wrong
num of map or reduce tasks. and generally where the overhead of
mapreduce proceduce will take place. Any suggestion will be
appreciated.
Thanks.
Re: how to optimize mapreduce procedure??
Posted by Ning Li <ni...@gmail.com>.
I would agree with Enis. MapReduce is good for batch building large
indexes, but not for search which requires realtime response.
Cheers,
Ning
On Fri, Mar 13, 2009 at 10:58 AM, Enis Soztutar <en...@gmail.com> wrote:
> ZhiHong Fu wrote:
>>
>> Hello,
>>
>> I'm writing a program which will finish lucene searching in
>> about 12 index directorys, all of them are stored in HDFS. It is done
>> like this:
>> 1. We get about 12 index Directorys through lucene index
>> functionality, each of which about 100M size,
>> 2. We store these 12 index directorys on hadoop HDFS , and this hadoop
>> cluster is made up of one namenode and five datanodes,totally 6
>> computers.
>> 3. And then I will do lucene searching for these 12 index directorys,
>> The mapreduce methods are as follows:
>> Map Procedure: 12 index directory will be splitted into
>> numOfMapTasks,for example, if numOfMapTasks=3, then each map we will
>> get 4 indexDirs and store them in an Intermediate Result.
>> Combine Procedure: for a intermediate Result locally, we will do
>> really lucene search in its containing index directory. and then store
>> these hit result in the intermediate Result.
>> Reduce Procedure: Reduce the Intermediate Results' hit result. and
>> get the search Result.
>>
>> But when I implement like this, I have a performance problem, I set
>> numOfMapTasks and numOfReduceTasks to any value,such as
>> numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will
>> spend about 28 seconds, and Obviously It is unacceptable.
>> So I'm confused whether I did wrong map-reduce procedure or set wrong
>> num of map or reduce tasks. and generally where the overhead of
>> mapreduce proceduce will take place. Any suggestion will be
>> appreciated.
>> Thanks.
>>
>
> Keeping the indexes at HDFS is not the best choice. Moreover mapreduce does
> not fit into the problem of distributed search over several nodes. The
> overhead of
> staring a new job for every search is not acceptable.
> You can use nutch distributed search or katta(not sure about the name)
> for this.
>
Re: how to optimize mapreduce procedure??
Posted by Enis Soztutar <en...@gmail.com>.
ZhiHong Fu wrote:
> Hello,
>
> I'm writing a program which will finish lucene searching in
> about 12 index directorys, all of them are stored in HDFS. It is done
> like this:
> 1. We get about 12 index Directorys through lucene index
> functionality, each of which about 100M size,
> 2. We store these 12 index directorys on hadoop HDFS , and this hadoop
> cluster is made up of one namenode and five datanodes,totally 6
> computers.
> 3. And then I will do lucene searching for these 12 index directorys,
> The mapreduce methods are as follows:
> Map Procedure: 12 index directory will be splitted into
> numOfMapTasks,for example, if numOfMapTasks=3, then each map we will
> get 4 indexDirs and store them in an Intermediate Result.
> Combine Procedure: for a intermediate Result locally, we will do
> really lucene search in its containing index directory. and then store
> these hit result in the intermediate Result.
> Reduce Procedure: Reduce the Intermediate Results' hit result. and
> get the search Result.
>
> But when I implement like this, I have a performance problem, I set
> numOfMapTasks and numOfReduceTasks to any value,such as
> numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will
> spend about 28 seconds, and Obviously It is unacceptable.
> So I'm confused whether I did wrong map-reduce procedure or set wrong
> num of map or reduce tasks. and generally where the overhead of
> mapreduce proceduce will take place. Any suggestion will be
> appreciated.
> Thanks.
>
Keeping the indexes at HDFS is not the best choice. Moreover mapreduce does
not fit into the problem of distributed search over several nodes. The
overhead of
staring a new job for every search is not acceptable.
You can use nutch distributed search or katta(not sure about the name)
for this.