You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Alfonso Olias Sanz <al...@gmail.com> on 2008/04/11 15:33:49 UTC

[HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks

Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

Posted by Amar Kamat <am...@yahoo-inc.com>.
One way to do this is to write your own (file) input format. See 
src/java/org/apache/hadoop/mapred/FileInputFormat.java. You need to 
override listPaths() in order to have selectivity amongst the files in 
the input folder.
Amar
Alfonso Olias Sanz wrote:
> Hi
> I have a general purpose input folder that it is used as input in a
> Map/Reduce task. That folder contains files grouped by names.
>
> I want to configure the JobConf in a way I can filter the files that
> have to be processed from that pass (ie  files which name starts by
> Elementary, or Source etc)  So the task function will only process
> those files.  So if the folder contains 1000 files and only 50 start
> by Elementary. Only those 50 will be processed by my task.
>
> I could set up different input folders and those containing the
> different files, but I cannot do that.
>
>
> Any idea?
>
> thanks
>   


Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

Posted by Alfonso Olias Sanz <al...@gmail.com>.
It's addIputPath, then adds a Path object to the list of inputs.
So doing the filtering first then adding the paths (loop).

But I need an InputFormat anyway because I have my own RecordReader.
At the end I have to put the same logic in a different place. From my
point of view it is better for me to put the filtering logic there,
because my InputFormat is also a RecordReader Factory, and it will
instantiate a different RecordReader, base on the filter.

cheers

On 14/04/2008, Ted Dunning <td...@veoh.com> wrote:
>
>  You don't really need a custom input format, I don't think.
>
>  You should be able to just add multiple inputs, one at a time after
>  filtering them outside hadoop.
>
>
>  On 4/14/08 10:59 AM, "Alfonso Olias Sanz" <al...@gmail.com>
>  wrote:
>
>
>  > ok thanks for the info :)
>  >
>  > On 11/04/2008, Arun C Murthy <ar...@yahoo-inc.com> wrote:
>  >>
>  >>  On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:
>  >>
>  >>
>  >>> A simpler way is to use
>  >> FileInputFormat.setInputPathFilter(JobConf, PathFilter).
>  >> Look at org.apache.hadoop.fs.PathFilter for details on PathFilter interface.
>  >>>
>  >>
>  >>  +1, although FileInputFormat.setInputPathFilter is
>  >> available only in hadoop-0.17 and above... like Amar mentioned previously,
>  >> you'd have to have a custom InputFormat prior to hadoop-0.17.
>  >>
>  >>  Arun
>  >>
>  >>
>  >>
>  >>> Amar
>  >>> Alfonso Olias Sanz wrote:
>  >>>
>  >>>> Hi
>  >>>> I have a general purpose input folder that it is used as input in a
>  >>>> Map/Reduce task. That folder contains files grouped by names.
>  >>>>
>  >>>> I want to configure the JobConf in a way I can filter the files that
>  >>>> have to be processed from that pass (ie  files which name starts by
>  >>>> Elementary, or Source etc)  So the task function will only process
>  >>>> those files.  So if the folder contains 1000 files and only 50 start
>  >>>> by Elementary. Only those 50 will be processed by my task.
>  >>>>
>  >>>> I could set up different input folders and those containing the
>  >>>> different files, but I cannot do that.
>  >>>>
>  >>>>
>  >>>> Any idea?
>  >>>>
>  >>>> thanks
>  >>>>
>  >>>>
>  >>>
>  >>>
>  >>
>  >>
>
>

Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

Posted by Ted Dunning <td...@veoh.com>.
You don't really need a custom input format, I don't think.

You should be able to just add multiple inputs, one at a time after
filtering them outside hadoop.


On 4/14/08 10:59 AM, "Alfonso Olias Sanz" <al...@gmail.com>
wrote:

> ok thanks for the info :)
> 
> On 11/04/2008, Arun C Murthy <ar...@yahoo-inc.com> wrote:
>> 
>>  On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:
>> 
>> 
>>> A simpler way is to use
>> FileInputFormat.setInputPathFilter(JobConf, PathFilter).
>> Look at org.apache.hadoop.fs.PathFilter for details on PathFilter interface.
>>> 
>> 
>>  +1, although FileInputFormat.setInputPathFilter is
>> available only in hadoop-0.17 and above... like Amar mentioned previously,
>> you'd have to have a custom InputFormat prior to hadoop-0.17.
>> 
>>  Arun
>> 
>> 
>> 
>>> Amar
>>> Alfonso Olias Sanz wrote:
>>> 
>>>> Hi
>>>> I have a general purpose input folder that it is used as input in a
>>>> Map/Reduce task. That folder contains files grouped by names.
>>>> 
>>>> I want to configure the JobConf in a way I can filter the files that
>>>> have to be processed from that pass (ie  files which name starts by
>>>> Elementary, or Source etc)  So the task function will only process
>>>> those files.  So if the folder contains 1000 files and only 50 start
>>>> by Elementary. Only those 50 will be processed by my task.
>>>> 
>>>> I could set up different input folders and those containing the
>>>> different files, but I cannot do that.
>>>> 
>>>> 
>>>> Any idea?
>>>> 
>>>> thanks
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 


Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

Posted by Alfonso Olias Sanz <al...@gmail.com>.
ok thanks for the info :)

On 11/04/2008, Arun C Murthy <ar...@yahoo-inc.com> wrote:
>
>  On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:
>
>
> > A simpler way is to use
> FileInputFormat.setInputPathFilter(JobConf, PathFilter).
> Look at org.apache.hadoop.fs.PathFilter for details on PathFilter interface.
> >
>
>  +1, although FileInputFormat.setInputPathFilter is
> available only in hadoop-0.17 and above... like Amar mentioned previously,
> you'd have to have a custom InputFormat prior to hadoop-0.17.
>
>  Arun
>
>
>
> > Amar
> > Alfonso Olias Sanz wrote:
> >
> > > Hi
> > > I have a general purpose input folder that it is used as input in a
> > > Map/Reduce task. That folder contains files grouped by names.
> > >
> > > I want to configure the JobConf in a way I can filter the files that
> > > have to be processed from that pass (ie  files which name starts by
> > > Elementary, or Source etc)  So the task function will only process
> > > those files.  So if the folder contains 1000 files and only 50 start
> > > by Elementary. Only those 50 will be processed by my task.
> > >
> > > I could set up different input folders and those containing the
> > > different files, but I cannot do that.
> > >
> > >
> > > Any idea?
> > >
> > > thanks
> > >
> > >
> >
> >
>
>

Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

Posted by Arun C Murthy <ar...@yahoo-inc.com>.
On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:

> A simpler way is to use FileInputFormat.setInputPathFilter(JobConf,  
> PathFilter). Look at org.apache.hadoop.fs.PathFilter for details on  
> PathFilter interface.

+1, although FileInputFormat.setInputPathFilter is available only in  
hadoop-0.17 and above... like Amar mentioned previously, you'd have  
to have a custom InputFormat prior to hadoop-0.17.

Arun

> Amar
> Alfonso Olias Sanz wrote:
>> Hi
>> I have a general purpose input folder that it is used as input in a
>> Map/Reduce task. That folder contains files grouped by names.
>>
>> I want to configure the JobConf in a way I can filter the files that
>> have to be processed from that pass (ie  files which name starts by
>> Elementary, or Source etc)  So the task function will only process
>> those files.  So if the folder contains 1000 files and only 50 start
>> by Elementary. Only those 50 will be processed by my task.
>>
>> I could set up different input folders and those containing the
>> different files, but I cannot do that.
>>
>>
>> Any idea?
>>
>> thanks
>>
>


Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

Posted by Amar Kamat <am...@yahoo-inc.com>.
A simpler way is to use FileInputFormat.setInputPathFilter(JobConf, 
PathFilter). Look at org.apache.hadoop.fs.PathFilter for details on 
PathFilter interface.
Amar
Alfonso Olias Sanz wrote:
> Hi
> I have a general purpose input folder that it is used as input in a
> Map/Reduce task. That folder contains files grouped by names.
>
> I want to configure the JobConf in a way I can filter the files that
> have to be processed from that pass (ie  files which name starts by
> Elementary, or Source etc)  So the task function will only process
> those files.  So if the folder contains 1000 files and only 50 start
> by Elementary. Only those 50 will be processed by my task.
>
> I could set up different input folders and those containing the
> different files, but I cannot do that.
>
>
> Any idea?
>
> thanks
>   


Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

Posted by Ted Dunning <td...@veoh.com>.
Just call addInputFile multiple times after filtering.  (or is it
addInputPath... Don't have documentation handy)


On 4/11/08 6:33 AM, "Alfonso Olias Sanz" <al...@gmail.com>
wrote:

> Hi
> I have a general purpose input folder that it is used as input in a
> Map/Reduce task. That folder contains files grouped by names.
> 
> I want to configure the JobConf in a way I can filter the files that
> have to be processed from that pass (ie  files which name starts by
> Elementary, or Source etc)  So the task function will only process
> those files.  So if the folder contains 1000 files and only 50 start
> by Elementary. Only those 50 will be processed by my task.
> 
> I could set up different input folders and those containing the
> different files, but I cannot do that.
> 
> 
> Any idea?
> 
> thanks