You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by novice user <pa...@gmail.com> on 2007/07/24 06:42:34 UTC

Specifying Input conditon to split file or specifying map tasks to work on assigned files individually

Hi,
 I am exploring hadoop and using it for one of my machine learning
application.
 I have a problem in which I need to route a particular input to each map
task separately. For example, I have list of <key, value >pairs sorted on
some condition in an input file. I want to split the input file on some
condition (for example, all key,value pairs which have the same key should
be given as input to a particular map task). I want to do this, so that all
the necessary extra information related to that input can be loaded into
memory once in that map task so that my map procedure will be faster.

So, Can some one please tell me is there a way to specify a condition in
such a way that a particular input will be given to a particular map task?
Or, I can split the files before hand and is there a way I can specify each
map task to work on each file separately with out any splits. Please
clarify.

Thanks in advance,
-- 
View this message in context: http://www.nabble.com/Specifying-Input-conditon-to-split-file-or-specifying-map-tasks-to-work-on-assigned-files-individually-tf4133840.html#a11756898
Sent from the Hadoop Users mailing list archive at Nabble.com.


Re: Specifying Input conditon to split file or specifying map tasks to work on assigned files individually

Posted by "David J. Biesack" <Da...@sas.com>.
> Date: Mon, 23 Jul 2007 21:42:34 -0700 (PDT)
> From: novice user <pa...@gmail.com>
> 
> 
> Hi,
>  I am exploring hadoop and using it for one of my machine learning
> application.
>  I have a problem in which I need to route a particular input to each map
> task separately. For example, I have list of <key, value >pairs sorted on
> some condition in an input file. I want to split the input file on some
> condition (for example, all key,value pairs which have the same key should
> be given as input to a particular map task). I want to do this, so that all
> the necessary extra information related to that input can be loaded into
> memory once in that map task so that my map procedure will be faster.

This sounds like you can put your map processing into your Reduce operation,
since Hadoop will already pass all values with the same key to your reducer.
Thus, an Identity map may suffice. (I've not tried running a job without
specifying a Map; maybe Hadoop works without one, in which case you do
not even need the Map.) In the case where you want to partition on 
other condition, can you not simply do that by mapping your keys into
the different enumerated values (perhaps pushing the old key into the
output value).  If I'm of the mark, sorry if I misinterpreted your problem statement.

-- 
David J. Biesack     SAS Institute Inc.
(919) 531-7771       SAS Campus Drive
http://www.sas.com   Cary, NC 27513


Re: Specifying Input conditon to split file or specifying map tasks to work on assigned files individually

Posted by Ted Dunning <td...@veoh.com>.

If you over-ride some implementation of InputFormat such as TextInputFormat,
you can cause Hadoop to not split your files by over-riding isSplitable.
This over-ride can be configured using JobConf.setInputFormat.

Now go to the wiki where you would hvave expected to find this answer and
either put the answer there or tell me where you would have looked.

Remember ASF = Apache Software Foundation.


On 7/23/07 9:42 PM, "novice user" <pa...@gmail.com> wrote:

> 
> Hi,
>  I am exploring hadoop and using it for one of my machine learning
> application.
>  I have a problem in which I need to route a particular input to each map
> task separately. For example, I have list of <key, value >pairs sorted on
> some condition in an input file. I want to split the input file on some
> condition (for example, all key,value pairs which have the same key should
> be given as input to a particular map task). I want to do this, so that all
> the necessary extra information related to that input can be loaded into
> memory once in that map task so that my map procedure will be faster.
> 
> So, Can some one please tell me is there a way to specify a condition in
> such a way that a particular input will be given to a particular map task?
> Or, I can split the files before hand and is there a way I can specify each
> map task to work on each file separately with out any splits. Please
> clarify.
> 
> Thanks in advance,