You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Koert Kuipers <ko...@tresata.com> on 2012/10/17 02:25:08 UTC

map-red with many input paths

currently i run a map-reduce job that reads from a single path with a glob:
"/data/*"
i am considering replacing this one glob path with an explicit list of all
the paths (so that i can check for _SUCCESS files in the subdirs and
exclude the subdirs that don't have this file, to avoid reading from
subdirs as data is being written to them).
there are hundreds of subdirectories in /data, and it will be thousands
soon... is there a limit on how many paths i can include for a map-red job?
is there a smarter way to do this?
thanks! koert

Re: map-red with many input paths

Posted by Lohit <lo...@gmail.com>.
There is no limit in the number of input path you can have for your job. The more input paths you have the more time is spent in calculating job split and hence startup cost of the job. 
You could write your own InputFormat which can do the filtering base on your use case. Take a look at MultiFileInputFormat if you want to club multiple files per map task. 
It is best to move completed job directories to some other path so as to avoid filtering altogether 

Lohit

On Oct 16, 2012, at 5:25 PM, Koert Kuipers <ko...@tresata.com> wrote:

> currently i run a map-reduce job that reads from a single path with a glob: "/data/*"
> i am considering replacing this one glob path with an explicit list of all the paths (so that i can check for _SUCCESS files in the subdirs and exclude the subdirs that don't have this file, to avoid reading from subdirs as data is being written to them).
> there are hundreds of subdirectories in /data, and it will be thousands soon... is there a limit on how many paths i can include for a map-red job? is there a smarter way to do this?
> thanks! koert

Re: map-red with many input paths

Posted by Lohit <lo...@gmail.com>.
There is no limit in the number of input path you can have for your job. The more input paths you have the more time is spent in calculating job split and hence startup cost of the job. 
You could write your own InputFormat which can do the filtering base on your use case. Take a look at MultiFileInputFormat if you want to club multiple files per map task. 
It is best to move completed job directories to some other path so as to avoid filtering altogether 

Lohit

On Oct 16, 2012, at 5:25 PM, Koert Kuipers <ko...@tresata.com> wrote:

> currently i run a map-reduce job that reads from a single path with a glob: "/data/*"
> i am considering replacing this one glob path with an explicit list of all the paths (so that i can check for _SUCCESS files in the subdirs and exclude the subdirs that don't have this file, to avoid reading from subdirs as data is being written to them).
> there are hundreds of subdirectories in /data, and it will be thousands soon... is there a limit on how many paths i can include for a map-red job? is there a smarter way to do this?
> thanks! koert

Re: map-red with many input paths

Posted by Lohit <lo...@gmail.com>.
There is no limit in the number of input path you can have for your job. The more input paths you have the more time is spent in calculating job split and hence startup cost of the job. 
You could write your own InputFormat which can do the filtering base on your use case. Take a look at MultiFileInputFormat if you want to club multiple files per map task. 
It is best to move completed job directories to some other path so as to avoid filtering altogether 

Lohit

On Oct 16, 2012, at 5:25 PM, Koert Kuipers <ko...@tresata.com> wrote:

> currently i run a map-reduce job that reads from a single path with a glob: "/data/*"
> i am considering replacing this one glob path with an explicit list of all the paths (so that i can check for _SUCCESS files in the subdirs and exclude the subdirs that don't have this file, to avoid reading from subdirs as data is being written to them).
> there are hundreds of subdirectories in /data, and it will be thousands soon... is there a limit on how many paths i can include for a map-red job? is there a smarter way to do this?
> thanks! koert

Re: map-red with many input paths

Posted by Lohit <lo...@gmail.com>.
There is no limit in the number of input path you can have for your job. The more input paths you have the more time is spent in calculating job split and hence startup cost of the job. 
You could write your own InputFormat which can do the filtering base on your use case. Take a look at MultiFileInputFormat if you want to club multiple files per map task. 
It is best to move completed job directories to some other path so as to avoid filtering altogether 

Lohit

On Oct 16, 2012, at 5:25 PM, Koert Kuipers <ko...@tresata.com> wrote:

> currently i run a map-reduce job that reads from a single path with a glob: "/data/*"
> i am considering replacing this one glob path with an explicit list of all the paths (so that i can check for _SUCCESS files in the subdirs and exclude the subdirs that don't have this file, to avoid reading from subdirs as data is being written to them).
> there are hundreds of subdirectories in /data, and it will be thousands soon... is there a limit on how many paths i can include for a map-red job? is there a smarter way to do this?
> thanks! koert