You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Albert Strasheim <fu...@gmail.com> on 2007/04/08 11:53:58 UTC

Performing exactly one map operation per file

Hello all

I'm a new Hadoop user and I'm looking at using Hadoop for a distributed 
machine learning application.

For my application (and probably many machine learning applications), one 
would probably want to do something like the following:

1. Upload a bunch of images/audio/whatever to the DFS
2. Run a map operation to do something like:
2.1 perform some transformation on each image, creating N new images
2.2 convert the audio into feature vectors, storing all the feature vectors 
from a single audio file in a new file
3. Store the output of these map operations in the DFS

In general, one wants to take a dataset with N discrete items, and map them 
to N other items. Each item can typically be mapped independently of the 
other items, so this distributes nicely. However, each item must be sent to 
the map operation as a unit.

I've looked through the Hadoop wiki and the code and so far I've come up 
with the following:

- HadoopStreaming will be useful, since my algorithms can be implemented as 
C++ or Python programs
- I probably want to use an IdentityReducer to achieve what I outlined above

>From what I understood from running the sample programs, Hadoop splits up 
input files and passes the pieces to the map operations. However, I can't 
quite figure out how one would create a job configuration that maps a single 
file at a time instead of splitting the file (which isn't what one wants 
when dealing with images or audio).

Does anybody have some ideas on how to accomplish this? I'm guessing some 
new code might have to be written, so any pointers on where to start would 
be much appreciated.

Thanks for your time.

Regards,

Albert

Re: Performing exactly one map operation per file

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

On Sun, Apr 08, 2007 at 07:23:52PM +0200, Albert Strasheim wrote:
>Hello all
>
>It seems my RecordReader has to specify the types of the keys and
>values. From looking at the other record readers, it seems like I want
>a BytesWritable value. However, I'm not sure what to do about the key.
>One probably wants some kind of string value bases on the full path to
>the input file...
>

The 'Mapper' interface which you have to implement for you mapper class also extends the JobConfigurable interface which means you can provide your own 'configure' method to which the framework passes the jobconf.xml (as a JobConf object). Here (@see org.apache.hadoop.mapred.SortValidator.RecordStatsChecker.Map.configure in src/test) you can save the actual input file which is available as 'map.input.file' and then use that as the 'key'. (PS: Use 'Text' class instead of 'UTF8' which is deprecated.)
>Assuming that gets sorted out, the job configuration would look
>something like this:
>
>mapred.input.format.class: SingleFileInputFormat
>mapred.output.format.class: SingleFileOutputFormat
>mapred.input.key.class: UTF8 (maybe?)
>mapred.input.value.class: BytesWritable
>mapred.output.key.class: UTF8 (maybe?)
>mapred.output.value.class: BytesWritable
>
>At this point, I'm unsure about how one would convince Hadoop to make
>an output file for each input file, and how the names for the output
>files are determined.
>

Take a look at org.apache.hadoop.examples.RandomWriter.Map.map() in src/examples. Here each map opens and writes to a file in hdfs in a specific directory and no output is sent to the reducer at all i.e. RandomWriter uses 'NullOutputFormat'. Thus, assuming you set up each input file as one of your audio/video/image file and thus has only 1 key/value pair, you can map one output file to one input file. This should solve your needs... you thus get the framework to spawn one map per input file and get away with only 1 reducer which is a no-op.

hth,
Arun

>>From the HadoopStreaming wiki page it seems that the number of output
>files depends on the number of reduce tasks, which probably isn't what
>one wants for this application. Any thoughts on what I can do here to
>get a one-to-one mapping? For example, I'd like to do something like:
>
>bin/hadoop -mapper crop.py -input origimgs/ -output croppedimgs/
>
>so that if origimgs/ contains foo.jpg and bar.jpg, I end up with cropped
>versions of foo.jpg and bar.jpg in croppedimgs/.
>
>I hope this isn't a case of square peg, round hole. Hadoop's DFS and
>job scheduling looks perfectly suited to this kind of application, if I
>can figure out how to make Hadoop divide the "work" in a way that makes
>sense in this case.
>
>>>>From what I understood from running the sample programs, Hadoop splits up
>>>input files and passes the pieces to the map operations. However, I can't
>>>quite figure out how one would create a job configuration that maps a
>>>single file at a time instead of splitting the file (which isn't what one
>>>wants when dealing with images or audio).
>>
>> The InputFormatBase defines an 'isSplitable' api which is used by
>>the framework to deduce whether the mapred framework splits up the
>>input files. You could trivially turn this off by returning 'false'
>>for your {Audio|Video|Image}InputFormat classes.
>
>Thanks, I'll try this.
>
>>>- HadoopStreaming will be useful, since my algorithms can be implemented 
>>>as
>>>C++ or Python programs
>>
>>The C++ map-reduce api that Owen has been working on might interest
>>you: http://issues.apache.org/jira/browse/HADOOP-234.
>
>I'll definately take a closer look at this.
>
>Regards,
>
>Albert 
>

Re: Performing exactly one map operation per file

Posted by Albert Strasheim <fu...@gmail.com>.

Hello all

On Sun, 08 Apr 2007, Arun C Murthy wrote:

> Hi Albert,
>
> On Sun, Apr 08, 2007 at 11:53:58AM +0200, Albert Strasheim wrote:
> >Hello all
> >
> >I'm a new Hadoop user and I'm looking at using Hadoop for a distributed
> >machine learning application.
>
> Welcome to Hadoop!
>
> Here is a broad outline of how hadoop's map-reduce framework works 
> specifically for user inputs/formats etc.:
> a) User specifies the input directory via JobConf.setInputPath or 
> mapred.input.dir in the .xml file.
> b) User specifies the format of the input files so that the framework
> can then decide how to break the data into 'records' i.e. key/value
> pairs which are then sent to the user defined map/reduce apis. I
> (depending on audio/image/video files etc.) by subclassing from
> org.apache.hadoop.mapred.InputFormatBase and also a
> org.apache.hadoop.mapred.RecordReader (which actually reads individual
> key/value pairs). There are some examples in org.apache.hadoop.mapred
> package for both the above: TextInputFormat/LineRecordReader and
> SequenceFileInputFormat/SequenceFileRecordReader; usually they come in
> pairs.

Thanks for the pointers. I'm on my way to coding up a
SingleFileInputFormat and a SingleFileRecordReader (for lack of better
names at present).

It seems my RecordReader has to specify the types of the keys and
values. From looking at the other record readers, it seems like I want
a BytesWritable value. However, I'm not sure what to do about the key.
One probably wants some kind of string value bases on the full path to
the input file...

Assuming that gets sorted out, the job configuration would look
something like this:

mapred.input.format.class: SingleFileInputFormat
mapred.output.format.class: SingleFileOutputFormat
mapred.input.key.class: UTF8 (maybe?)
mapred.input.value.class: BytesWritable
mapred.output.key.class: UTF8 (maybe?)
mapred.output.value.class: BytesWritable

At this point, I'm unsure about how one would convince Hadoop to make
an output file for each input file, and how the names for the output
files are determined.

>From the HadoopStreaming wiki page it seems that the number of output
files depends on the number of reduce tasks, which probably isn't what
one wants for this application. Any thoughts on what I can do here to
get a one-to-one mapping? For example, I'd like to do something like:

bin/hadoop -mapper crop.py -input origimgs/ -output croppedimgs/

so that if origimgs/ contains foo.jpg and bar.jpg, I end up with cropped
versions of foo.jpg and bar.jpg in croppedimgs/.

I hope this isn't a case of square peg, round hole. Hadoop's DFS and
job scheduling looks perfectly suited to this kind of application, if I
can figure out how to make Hadoop divide the "work" in a way that makes
sense in this case.

> >From what I understood from running the sample programs, Hadoop splits up
> >input files and passes the pieces to the map operations. However, I can't
> >quite figure out how one would create a job configuration that maps a
> >single file at a time instead of splitting the file (which isn't what one
> >wants when dealing with images or audio).
>
>  The InputFormatBase defines an 'isSplitable' api which is used by
> the framework to deduce whether the mapred framework splits up the
> input files. You could trivially turn this off by returning 'false'
> for your {Audio|Video|Image}InputFormat classes.

Thanks, I'll try this.

> >- HadoopStreaming will be useful, since my algorithms can be implemented 
> >as
> >C++ or Python programs
>
> The C++ map-reduce api that Owen has been working on might interest
> you: http://issues.apache.org/jira/browse/HADOOP-234.

I'll definately take a closer look at this.

Regards,

Albert

Re: Performing exactly one map operation per file

Posted by Albert Strasheim <fu...@gmail.com>.

Hello all

On Sun, 08 Apr 2007, Arun C Murthy wrote:

> Hi Albert,
> 
> On Sun, Apr 08, 2007 at 11:53:58AM +0200, Albert Strasheim wrote:
> >Hello all
> >
> >I'm a new Hadoop user and I'm looking at using Hadoop for a distributed 
> >machine learning application.
> 
> Welcome to Hadoop!
> 
> Here is a broad outline of how hadoop's map-reduce framework works specifically for user inputs/formats etc.:
> a) User specifies the input directory via JobConf.setInputPath or mapred.input.dir in the .xml file.
> b) User specifies the format of the input files so that the framework 
> can then decide how to break the data into 'records' i.e. key/value 
> pairs which are then sent to the user defined map/reduce apis. I 
> (depending on audio/image/video files etc.) by subclassing from 
> org.apache.hadoop.mapred.InputFormatBase and also a 
> org.apache.hadoop.mapred.RecordReader (which actually reads individual 
> key/value pairs). There are some examples in org.apache.hadoop.mapred 
> package for both the above: TextInputFormat/LineRecordReader and 
> SequenceFileInputFormat/SequenceFileRecordReader; usually they come in 
> pairs.

Thanks for the pointers. I'm on my way to coding up a 
SingleFileInputFormat and a SingleFileRecordReader (for lack of better 
names at present).

It seems my RecordReader has to specify the types of the keys and 
values. From looking at the other record readers, it seems like I want 
a BytesWritable value. However, I'm not sure what to do about the key. 
One probably wants some kind of string value bases on the full path to 
the input file...

Assuming that gets sorted out, the job configuration would look 
something like this:

mapred.input.format.class: SingleFileInputFormat
mapred.output.format.class: SingleFileOutputFormat
mapred.input.key.class: UTF8 (maybe?)
mapred.input.value.class: BytesWritable
mapred.output.key.class: UTF8 (maybe?)
mapred.output.value.class: BytesWritable

(Not quite sure what to do about a partitioner yet.)

At this point, I'm unsure about how one would convince Hadoop to make 
an output file for each input file, and how the names for the output 
files are determined.

>From the HadoopStreaming wiki page it seems that the number of output 
files depends on the number of reduce tasks, which probably isn't what 
one wants for this application. Any thoughts on what I can do here to 
get a one-to-one mapping? For example, I'd like to do something like:

bin/hadoop -mapper crop.py -input origimgs/ -output croppedimgs/

so that if origimgs/ contains foo.jpg and bar.jpg, I end up with cropped 
versions of foo.jpg and bar.jpg in croppedimgs/.

I hope this isn't a case of square peg, round hole. Hadoop's DFS and 
job scheduling looks perfectly suited to this kind of application, if I 
can figure out how to make Hadoop divide the "work" in a way that makes 
sense in this case.

> >From what I understood from running the sample programs, Hadoop splits up 
> >input files and passes the pieces to the map operations. However, I can't 
> >quite figure out how one would create a job configuration that maps a 
> >single file at a time instead of splitting the file (which isn't what one 
> >wants when dealing with images or audio).
> 
>  The InputFormatBase defines an 'isSplitable' api which is used by 
> the framework to deduce whether the mapred framework splits up the 
> input files. You could trivially turn this off by returning 'false' 
> for your {Audio|Video|Image}InputFormat classes.

Thanks, I'll try this.

> >- HadoopStreaming will be useful, since my algorithms can be implemented as 
> >C++ or Python programs
> 
> The C++ map-reduce api that Owen has been working on might interest 
> you: http://issues.apache.org/jira/browse/HADOOP-234.

I'll definately take a closer look at this.

Regards,

Albert

Re: Performing exactly one map operation per file

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

Hi Albert,

On Sun, Apr 08, 2007 at 11:53:58AM +0200, Albert Strasheim wrote:
>Hello all
>
>I'm a new Hadoop user and I'm looking at using Hadoop for a distributed 
>machine learning application.

Welcome to Hadoop!

Here is a broad outline of how hadoop's map-reduce framework works specifically for user inputs/formats etc.:
a) User specifies the input directory via JobConf.setInputPath or mapred.input.dir in the .xml file.
b) User specifies the format of the input files so that the framework can then decide how to break the data into 'records' i.e. key/value pairs which are then sent to the user defined map/reduce apis. I suspect you will have to come up with your own InputFormat class (depending on audio/image/video files etc.) by subclassing from org.apache.hadoop.mapred.InputFormatBase and also a org.apache.hadoop.mapred.RecordReader (which actually reads individual key/value pairs). There are some examples in org.apache.hadoop.mapred package for both the above: TextInputFormat/LineRecordReader and SequenceFileInputFormat/SequenceFileRecordReader; usually they come in pairs.

>
>>From what I understood from running the sample programs, Hadoop splits up 
>input files and passes the pieces to the map operations. However, I can't 
>quite figure out how one would create a job configuration that maps a 
>single file at a time instead of splitting the file (which isn't what one 
>wants when dealing with images or audio).
>

 The InputFormatBase defines an 'isSplitable' api which is used by the framework to deduce whether the mapred framework splits up the input files. You could trivially turn this off by returning 'false' for your {Audio|Video|Image}InputFormat classes.

>- HadoopStreaming will be useful, since my algorithms can be implemented as 
>C++ or Python programs

The C++ map-reduce api that Owen has been working on might interest you: http://issues.apache.org/jira/browse/HADOOP-234.

hth,
Arun