You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Nishanth S <ch...@gmail.com> on 2015/07/22 21:06:56 UTC

Sorting the inputSplits

Hey folks,

Is their a way to sort the input splits  in map reduce.We have a case where
there are two files file1 and file2 in the input directory.Since we  have
custominputformat which   has issplittable return false always each of
 these files would be processed  by  a different mapper.How could I make
sure that  file1 is processed   before  file2(I want the oldest file to  be
processed first).Is this possible?.

Thanks,
Nishan

Re: Sorting the inputSplits

Posted by Rudra Tripathy <ru...@gmail.com>.
Hi Nishanth
Even if u ordered input split u can't order d output
On Aug 19, 2015 1:55 AM, "Nishanth S" <ch...@gmail.com> wrote:

> Thank you.I have   explained the problem better here below.Is this
> possible?.
>
>
> We have a use case where we have files in   the below directory structure.
> The requirement is that we  should not process files inside a Parent
> directory in parallel(1.txt and 2.txt  cannot be processed in parallel
> since we need to do some check pointing we have to process the oldest file
> first).How ever 1.txt and 5.txt can be processed in parallel. Right now I
> am  over riding the list status method to pick only the oldest file but
> this means I cannot achieve parallelism outside the parent as well since
> the number of input splits is always 1. What would be  the way to go about
> this use case ?.In short I want to achieve parallelism outside Parent
> directory but not within it. Please advise.
>
>
>
> published/
>
> +-- Parent1/
>
> ¦       +-- 1.txt
>
> ¦       +-- 2.txt
>
> ¦       +-- 3.txt
>
> +-- Parent2/
>
>           +-- 4.txt
>
>            +-- 5.txt
>
>
>
>
> On Wed, Jul 29, 2015 at 5:31 PM, Gera Shegalov <ge...@shegalov.com> wrote:
>
>> Can you clarify the requirement "processed first"? Maps run in parallel
>> without any ordering guarantees. If you want to affect the mapping
>> file->split number, you can implement your own getSplits in the custom
>> input format and return splits ordered anyway your like.
>>
>> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> Is their a way to sort the input splits  in map reduce.We have a case
>>> where there are two files file1 and file2 in the input directory.Since we
>>>  have custominputformat which   has issplittable return false always each
>>> of  these files would be processed  by  a different mapper.How could I make
>>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>>> processed first).Is this possible?.
>>>
>>> Thanks,
>>> Nishan
>>>
>>
>>
>

Re: Sorting the inputSplits

Posted by Rudra Tripathy <ru...@gmail.com>.
Hi Nishanth
Even if u ordered input split u can't order d output
On Aug 19, 2015 1:55 AM, "Nishanth S" <ch...@gmail.com> wrote:

> Thank you.I have   explained the problem better here below.Is this
> possible?.
>
>
> We have a use case where we have files in   the below directory structure.
> The requirement is that we  should not process files inside a Parent
> directory in parallel(1.txt and 2.txt  cannot be processed in parallel
> since we need to do some check pointing we have to process the oldest file
> first).How ever 1.txt and 5.txt can be processed in parallel. Right now I
> am  over riding the list status method to pick only the oldest file but
> this means I cannot achieve parallelism outside the parent as well since
> the number of input splits is always 1. What would be  the way to go about
> this use case ?.In short I want to achieve parallelism outside Parent
> directory but not within it. Please advise.
>
>
>
> published/
>
> +-- Parent1/
>
> ¦       +-- 1.txt
>
> ¦       +-- 2.txt
>
> ¦       +-- 3.txt
>
> +-- Parent2/
>
>           +-- 4.txt
>
>            +-- 5.txt
>
>
>
>
> On Wed, Jul 29, 2015 at 5:31 PM, Gera Shegalov <ge...@shegalov.com> wrote:
>
>> Can you clarify the requirement "processed first"? Maps run in parallel
>> without any ordering guarantees. If you want to affect the mapping
>> file->split number, you can implement your own getSplits in the custom
>> input format and return splits ordered anyway your like.
>>
>> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> Is their a way to sort the input splits  in map reduce.We have a case
>>> where there are two files file1 and file2 in the input directory.Since we
>>>  have custominputformat which   has issplittable return false always each
>>> of  these files would be processed  by  a different mapper.How could I make
>>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>>> processed first).Is this possible?.
>>>
>>> Thanks,
>>> Nishan
>>>
>>
>>
>

Re: Sorting the inputSplits

Posted by Rudra Tripathy <ru...@gmail.com>.
Hi Nishanth
Even if u ordered input split u can't order d output
On Aug 19, 2015 1:55 AM, "Nishanth S" <ch...@gmail.com> wrote:

> Thank you.I have   explained the problem better here below.Is this
> possible?.
>
>
> We have a use case where we have files in   the below directory structure.
> The requirement is that we  should not process files inside a Parent
> directory in parallel(1.txt and 2.txt  cannot be processed in parallel
> since we need to do some check pointing we have to process the oldest file
> first).How ever 1.txt and 5.txt can be processed in parallel. Right now I
> am  over riding the list status method to pick only the oldest file but
> this means I cannot achieve parallelism outside the parent as well since
> the number of input splits is always 1. What would be  the way to go about
> this use case ?.In short I want to achieve parallelism outside Parent
> directory but not within it. Please advise.
>
>
>
> published/
>
> +-- Parent1/
>
> ¦       +-- 1.txt
>
> ¦       +-- 2.txt
>
> ¦       +-- 3.txt
>
> +-- Parent2/
>
>           +-- 4.txt
>
>            +-- 5.txt
>
>
>
>
> On Wed, Jul 29, 2015 at 5:31 PM, Gera Shegalov <ge...@shegalov.com> wrote:
>
>> Can you clarify the requirement "processed first"? Maps run in parallel
>> without any ordering guarantees. If you want to affect the mapping
>> file->split number, you can implement your own getSplits in the custom
>> input format and return splits ordered anyway your like.
>>
>> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> Is their a way to sort the input splits  in map reduce.We have a case
>>> where there are two files file1 and file2 in the input directory.Since we
>>>  have custominputformat which   has issplittable return false always each
>>> of  these files would be processed  by  a different mapper.How could I make
>>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>>> processed first).Is this possible?.
>>>
>>> Thanks,
>>> Nishan
>>>
>>
>>
>

Re: Sorting the inputSplits

Posted by Rudra Tripathy <ru...@gmail.com>.
Hi Nishanth
Even if u ordered input split u can't order d output
On Aug 19, 2015 1:55 AM, "Nishanth S" <ch...@gmail.com> wrote:

> Thank you.I have   explained the problem better here below.Is this
> possible?.
>
>
> We have a use case where we have files in   the below directory structure.
> The requirement is that we  should not process files inside a Parent
> directory in parallel(1.txt and 2.txt  cannot be processed in parallel
> since we need to do some check pointing we have to process the oldest file
> first).How ever 1.txt and 5.txt can be processed in parallel. Right now I
> am  over riding the list status method to pick only the oldest file but
> this means I cannot achieve parallelism outside the parent as well since
> the number of input splits is always 1. What would be  the way to go about
> this use case ?.In short I want to achieve parallelism outside Parent
> directory but not within it. Please advise.
>
>
>
> published/
>
> +-- Parent1/
>
> ¦       +-- 1.txt
>
> ¦       +-- 2.txt
>
> ¦       +-- 3.txt
>
> +-- Parent2/
>
>           +-- 4.txt
>
>            +-- 5.txt
>
>
>
>
> On Wed, Jul 29, 2015 at 5:31 PM, Gera Shegalov <ge...@shegalov.com> wrote:
>
>> Can you clarify the requirement "processed first"? Maps run in parallel
>> without any ordering guarantees. If you want to affect the mapping
>> file->split number, you can implement your own getSplits in the custom
>> input format and return splits ordered anyway your like.
>>
>> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> Is their a way to sort the input splits  in map reduce.We have a case
>>> where there are two files file1 and file2 in the input directory.Since we
>>>  have custominputformat which   has issplittable return false always each
>>> of  these files would be processed  by  a different mapper.How could I make
>>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>>> processed first).Is this possible?.
>>>
>>> Thanks,
>>> Nishan
>>>
>>
>>
>

Re: Sorting the inputSplits

Posted by Nishanth S <ch...@gmail.com>.
Thank you.I have   explained the problem better here below.Is this
possible?.


We have a use case where we have files in   the below directory structure.
The requirement is that we  should not process files inside a Parent
directory in parallel(1.txt and 2.txt  cannot be processed in parallel
since we need to do some check pointing we have to process the oldest file
first).How ever 1.txt and 5.txt can be processed in parallel. Right now I
am  over riding the list status method to pick only the oldest file but
this means I cannot achieve parallelism outside the parent as well since
the number of input splits is always 1. What would be  the way to go about
this use case ?.In short I want to achieve parallelism outside Parent
directory but not within it. Please advise.



published/

+-- Parent1/

¦       +-- 1.txt

¦       +-- 2.txt

¦       +-- 3.txt

+-- Parent2/

          +-- 4.txt

           +-- 5.txt




On Wed, Jul 29, 2015 at 5:31 PM, Gera Shegalov <ge...@shegalov.com> wrote:

> Can you clarify the requirement "processed first"? Maps run in parallel
> without any ordering guarantees. If you want to affect the mapping
> file->split number, you can implement your own getSplits in the custom
> input format and return splits ordered anyway your like.
>
> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
> wrote:
>
>> Hey folks,
>>
>> Is their a way to sort the input splits  in map reduce.We have a case
>> where there are two files file1 and file2 in the input directory.Since we
>>  have custominputformat which   has issplittable return false always each
>> of  these files would be processed  by  a different mapper.How could I make
>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>> processed first).Is this possible?.
>>
>> Thanks,
>> Nishan
>>
>
>

Re: Sorting the inputSplits

Posted by Nishanth S <ch...@gmail.com>.
Thank you.I have   explained the problem better here below.Is this
possible?.


We have a use case where we have files in   the below directory structure.
The requirement is that we  should not process files inside a Parent
directory in parallel(1.txt and 2.txt  cannot be processed in parallel
since we need to do some check pointing we have to process the oldest file
first).How ever 1.txt and 5.txt can be processed in parallel. Right now I
am  over riding the list status method to pick only the oldest file but
this means I cannot achieve parallelism outside the parent as well since
the number of input splits is always 1. What would be  the way to go about
this use case ?.In short I want to achieve parallelism outside Parent
directory but not within it. Please advise.



published/

+-- Parent1/

¦       +-- 1.txt

¦       +-- 2.txt

¦       +-- 3.txt

+-- Parent2/

          +-- 4.txt

           +-- 5.txt




On Wed, Jul 29, 2015 at 5:31 PM, Gera Shegalov <ge...@shegalov.com> wrote:

> Can you clarify the requirement "processed first"? Maps run in parallel
> without any ordering guarantees. If you want to affect the mapping
> file->split number, you can implement your own getSplits in the custom
> input format and return splits ordered anyway your like.
>
> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
> wrote:
>
>> Hey folks,
>>
>> Is their a way to sort the input splits  in map reduce.We have a case
>> where there are two files file1 and file2 in the input directory.Since we
>>  have custominputformat which   has issplittable return false always each
>> of  these files would be processed  by  a different mapper.How could I make
>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>> processed first).Is this possible?.
>>
>> Thanks,
>> Nishan
>>
>
>

Re: Sorting the inputSplits

Posted by Nishanth S <ch...@gmail.com>.
Thank you.I have   explained the problem better here below.Is this
possible?.


We have a use case where we have files in   the below directory structure.
The requirement is that we  should not process files inside a Parent
directory in parallel(1.txt and 2.txt  cannot be processed in parallel
since we need to do some check pointing we have to process the oldest file
first).How ever 1.txt and 5.txt can be processed in parallel. Right now I
am  over riding the list status method to pick only the oldest file but
this means I cannot achieve parallelism outside the parent as well since
the number of input splits is always 1. What would be  the way to go about
this use case ?.In short I want to achieve parallelism outside Parent
directory but not within it. Please advise.



published/

+-- Parent1/

¦       +-- 1.txt

¦       +-- 2.txt

¦       +-- 3.txt

+-- Parent2/

          +-- 4.txt

           +-- 5.txt




On Wed, Jul 29, 2015 at 5:31 PM, Gera Shegalov <ge...@shegalov.com> wrote:

> Can you clarify the requirement "processed first"? Maps run in parallel
> without any ordering guarantees. If you want to affect the mapping
> file->split number, you can implement your own getSplits in the custom
> input format and return splits ordered anyway your like.
>
> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
> wrote:
>
>> Hey folks,
>>
>> Is their a way to sort the input splits  in map reduce.We have a case
>> where there are two files file1 and file2 in the input directory.Since we
>>  have custominputformat which   has issplittable return false always each
>> of  these files would be processed  by  a different mapper.How could I make
>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>> processed first).Is this possible?.
>>
>> Thanks,
>> Nishan
>>
>
>

Re: Sorting the inputSplits

Posted by Harsh J <ha...@cloudera.com>.
If you meant 'scheduled' first perhaps thats doable by following (almost)
what Gera says. The framework actually explicitly sorts your InputSplits
list by its reported lengths, which would serve as the hack point for
inducing a reordering. See
https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java#L498-L503

On Thu, Jul 30, 2015 at 10:34 PM Niels Basjes <Ni...@basjes.nl> wrote:

> MapReduce is based on the premise that several parts of a task can be
> processed independently in parallel.
> If you "require" an order of processing then these files are depending on
> each other. Why use MapReduce at all?
> With your requirement you cannot use more than one CPU anyway.
>
> Niels
>
> On Thu, 30 Jul 2015 01:31 Gera Shegalov <ge...@shegalov.com> wrote:
>
>> Can you clarify the requirement "processed first"? Maps run in parallel
>> without any ordering guarantees. If you want to affect the mapping
>> file->split number, you can implement your own getSplits in the custom
>> input format and return splits ordered anyway your like.
>>
>> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> Is their a way to sort the input splits  in map reduce.We have a case
>>> where there are two files file1 and file2 in the input directory.Since we
>>>  have custominputformat which   has issplittable return false always each
>>> of  these files would be processed  by  a different mapper.How could I make
>>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>>> processed first).Is this possible?.
>>>
>>> Thanks,
>>> Nishan
>>>
>>
>>

Re: Sorting the inputSplits

Posted by Harsh J <ha...@cloudera.com>.
If you meant 'scheduled' first perhaps thats doable by following (almost)
what Gera says. The framework actually explicitly sorts your InputSplits
list by its reported lengths, which would serve as the hack point for
inducing a reordering. See
https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java#L498-L503

On Thu, Jul 30, 2015 at 10:34 PM Niels Basjes <Ni...@basjes.nl> wrote:

> MapReduce is based on the premise that several parts of a task can be
> processed independently in parallel.
> If you "require" an order of processing then these files are depending on
> each other. Why use MapReduce at all?
> With your requirement you cannot use more than one CPU anyway.
>
> Niels
>
> On Thu, 30 Jul 2015 01:31 Gera Shegalov <ge...@shegalov.com> wrote:
>
>> Can you clarify the requirement "processed first"? Maps run in parallel
>> without any ordering guarantees. If you want to affect the mapping
>> file->split number, you can implement your own getSplits in the custom
>> input format and return splits ordered anyway your like.
>>
>> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> Is their a way to sort the input splits  in map reduce.We have a case
>>> where there are two files file1 and file2 in the input directory.Since we
>>>  have custominputformat which   has issplittable return false always each
>>> of  these files would be processed  by  a different mapper.How could I make
>>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>>> processed first).Is this possible?.
>>>
>>> Thanks,
>>> Nishan
>>>
>>
>>

Re: Sorting the inputSplits

Posted by Harsh J <ha...@cloudera.com>.
If you meant 'scheduled' first perhaps thats doable by following (almost)
what Gera says. The framework actually explicitly sorts your InputSplits
list by its reported lengths, which would serve as the hack point for
inducing a reordering. See
https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java#L498-L503

On Thu, Jul 30, 2015 at 10:34 PM Niels Basjes <Ni...@basjes.nl> wrote:

> MapReduce is based on the premise that several parts of a task can be
> processed independently in parallel.
> If you "require" an order of processing then these files are depending on
> each other. Why use MapReduce at all?
> With your requirement you cannot use more than one CPU anyway.
>
> Niels
>
> On Thu, 30 Jul 2015 01:31 Gera Shegalov <ge...@shegalov.com> wrote:
>
>> Can you clarify the requirement "processed first"? Maps run in parallel
>> without any ordering guarantees. If you want to affect the mapping
>> file->split number, you can implement your own getSplits in the custom
>> input format and return splits ordered anyway your like.
>>
>> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> Is their a way to sort the input splits  in map reduce.We have a case
>>> where there are two files file1 and file2 in the input directory.Since we
>>>  have custominputformat which   has issplittable return false always each
>>> of  these files would be processed  by  a different mapper.How could I make
>>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>>> processed first).Is this possible?.
>>>
>>> Thanks,
>>> Nishan
>>>
>>
>>

Re: Sorting the inputSplits

Posted by Harsh J <ha...@cloudera.com>.
If you meant 'scheduled' first perhaps thats doable by following (almost)
what Gera says. The framework actually explicitly sorts your InputSplits
list by its reported lengths, which would serve as the hack point for
inducing a reordering. See
https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java#L498-L503

On Thu, Jul 30, 2015 at 10:34 PM Niels Basjes <Ni...@basjes.nl> wrote:

> MapReduce is based on the premise that several parts of a task can be
> processed independently in parallel.
> If you "require" an order of processing then these files are depending on
> each other. Why use MapReduce at all?
> With your requirement you cannot use more than one CPU anyway.
>
> Niels
>
> On Thu, 30 Jul 2015 01:31 Gera Shegalov <ge...@shegalov.com> wrote:
>
>> Can you clarify the requirement "processed first"? Maps run in parallel
>> without any ordering guarantees. If you want to affect the mapping
>> file->split number, you can implement your own getSplits in the custom
>> input format and return splits ordered anyway your like.
>>
>> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> Is their a way to sort the input splits  in map reduce.We have a case
>>> where there are two files file1 and file2 in the input directory.Since we
>>>  have custominputformat which   has issplittable return false always each
>>> of  these files would be processed  by  a different mapper.How could I make
>>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>>> processed first).Is this possible?.
>>>
>>> Thanks,
>>> Nishan
>>>
>>
>>

Re: Sorting the inputSplits

Posted by Niels Basjes <Ni...@basjes.nl>.
MapReduce is based on the premise that several parts of a task can be
processed independently in parallel.
If you "require" an order of processing then these files are depending on
each other. Why use MapReduce at all?
With your requirement you cannot use more than one CPU anyway.

Niels

On Thu, 30 Jul 2015 01:31 Gera Shegalov <ge...@shegalov.com> wrote:

> Can you clarify the requirement "processed first"? Maps run in parallel
> without any ordering guarantees. If you want to affect the mapping
> file->split number, you can implement your own getSplits in the custom
> input format and return splits ordered anyway your like.
>
> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
> wrote:
>
>> Hey folks,
>>
>> Is their a way to sort the input splits  in map reduce.We have a case
>> where there are two files file1 and file2 in the input directory.Since we
>>  have custominputformat which   has issplittable return false always each
>> of  these files would be processed  by  a different mapper.How could I make
>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>> processed first).Is this possible?.
>>
>> Thanks,
>> Nishan
>>
>
>

Re: Sorting the inputSplits

Posted by Niels Basjes <Ni...@basjes.nl>.
MapReduce is based on the premise that several parts of a task can be
processed independently in parallel.
If you "require" an order of processing then these files are depending on
each other. Why use MapReduce at all?
With your requirement you cannot use more than one CPU anyway.

Niels

On Thu, 30 Jul 2015 01:31 Gera Shegalov <ge...@shegalov.com> wrote:

> Can you clarify the requirement "processed first"? Maps run in parallel
> without any ordering guarantees. If you want to affect the mapping
> file->split number, you can implement your own getSplits in the custom
> input format and return splits ordered anyway your like.
>
> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
> wrote:
>
>> Hey folks,
>>
>> Is their a way to sort the input splits  in map reduce.We have a case
>> where there are two files file1 and file2 in the input directory.Since we
>>  have custominputformat which   has issplittable return false always each
>> of  these files would be processed  by  a different mapper.How could I make
>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>> processed first).Is this possible?.
>>
>> Thanks,
>> Nishan
>>
>
>

Re: Sorting the inputSplits

Posted by Niels Basjes <Ni...@basjes.nl>.
MapReduce is based on the premise that several parts of a task can be
processed independently in parallel.
If you "require" an order of processing then these files are depending on
each other. Why use MapReduce at all?
With your requirement you cannot use more than one CPU anyway.

Niels

On Thu, 30 Jul 2015 01:31 Gera Shegalov <ge...@shegalov.com> wrote:

> Can you clarify the requirement "processed first"? Maps run in parallel
> without any ordering guarantees. If you want to affect the mapping
> file->split number, you can implement your own getSplits in the custom
> input format and return splits ordered anyway your like.
>
> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
> wrote:
>
>> Hey folks,
>>
>> Is their a way to sort the input splits  in map reduce.We have a case
>> where there are two files file1 and file2 in the input directory.Since we
>>  have custominputformat which   has issplittable return false always each
>> of  these files would be processed  by  a different mapper.How could I make
>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>> processed first).Is this possible?.
>>
>> Thanks,
>> Nishan
>>
>
>

Re: Sorting the inputSplits

Posted by Niels Basjes <Ni...@basjes.nl>.
MapReduce is based on the premise that several parts of a task can be
processed independently in parallel.
If you "require" an order of processing then these files are depending on
each other. Why use MapReduce at all?
With your requirement you cannot use more than one CPU anyway.

Niels

On Thu, 30 Jul 2015 01:31 Gera Shegalov <ge...@shegalov.com> wrote:

> Can you clarify the requirement "processed first"? Maps run in parallel
> without any ordering guarantees. If you want to affect the mapping
> file->split number, you can implement your own getSplits in the custom
> input format and return splits ordered anyway your like.
>
> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
> wrote:
>
>> Hey folks,
>>
>> Is their a way to sort the input splits  in map reduce.We have a case
>> where there are two files file1 and file2 in the input directory.Since we
>>  have custominputformat which   has issplittable return false always each
>> of  these files would be processed  by  a different mapper.How could I make
>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>> processed first).Is this possible?.
>>
>> Thanks,
>> Nishan
>>
>
>

Re: Sorting the inputSplits

Posted by Nishanth S <ch...@gmail.com>.
Thank you.I have   explained the problem better here below.Is this
possible?.


We have a use case where we have files in   the below directory structure.
The requirement is that we  should not process files inside a Parent
directory in parallel(1.txt and 2.txt  cannot be processed in parallel
since we need to do some check pointing we have to process the oldest file
first).How ever 1.txt and 5.txt can be processed in parallel. Right now I
am  over riding the list status method to pick only the oldest file but
this means I cannot achieve parallelism outside the parent as well since
the number of input splits is always 1. What would be  the way to go about
this use case ?.In short I want to achieve parallelism outside Parent
directory but not within it. Please advise.



published/

+-- Parent1/

¦       +-- 1.txt

¦       +-- 2.txt

¦       +-- 3.txt

+-- Parent2/

          +-- 4.txt

           +-- 5.txt




On Wed, Jul 29, 2015 at 5:31 PM, Gera Shegalov <ge...@shegalov.com> wrote:

> Can you clarify the requirement "processed first"? Maps run in parallel
> without any ordering guarantees. If you want to affect the mapping
> file->split number, you can implement your own getSplits in the custom
> input format and return splits ordered anyway your like.
>
> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com>
> wrote:
>
>> Hey folks,
>>
>> Is their a way to sort the input splits  in map reduce.We have a case
>> where there are two files file1 and file2 in the input directory.Since we
>>  have custominputformat which   has issplittable return false always each
>> of  these files would be processed  by  a different mapper.How could I make
>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>> processed first).Is this possible?.
>>
>> Thanks,
>> Nishan
>>
>
>

Re: Sorting the inputSplits

Posted by Gera Shegalov <ge...@shegalov.com>.
Can you clarify the requirement "processed first"? Maps run in parallel
without any ordering guarantees. If you want to affect the mapping
file->split number, you can implement your own getSplits in the custom
input format and return splits ordered anyway your like.

On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com> wrote:

> Hey folks,
>
> Is their a way to sort the input splits  in map reduce.We have a case
> where there are two files file1 and file2 in the input directory.Since we
>  have custominputformat which   has issplittable return false always each
> of  these files would be processed  by  a different mapper.How could I make
> sure that  file1 is processed   before  file2(I want the oldest file to  be
> processed first).Is this possible?.
>
> Thanks,
> Nishan
>

Re: Sorting the inputSplits

Posted by Gera Shegalov <ge...@shegalov.com>.
Can you clarify the requirement "processed first"? Maps run in parallel
without any ordering guarantees. If you want to affect the mapping
file->split number, you can implement your own getSplits in the custom
input format and return splits ordered anyway your like.

On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com> wrote:

> Hey folks,
>
> Is their a way to sort the input splits  in map reduce.We have a case
> where there are two files file1 and file2 in the input directory.Since we
>  have custominputformat which   has issplittable return false always each
> of  these files would be processed  by  a different mapper.How could I make
> sure that  file1 is processed   before  file2(I want the oldest file to  be
> processed first).Is this possible?.
>
> Thanks,
> Nishan
>

Re: Sorting the inputSplits

Posted by Gera Shegalov <ge...@shegalov.com>.
Can you clarify the requirement "processed first"? Maps run in parallel
without any ordering guarantees. If you want to affect the mapping
file->split number, you can implement your own getSplits in the custom
input format and return splits ordered anyway your like.

On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com> wrote:

> Hey folks,
>
> Is their a way to sort the input splits  in map reduce.We have a case
> where there are two files file1 and file2 in the input directory.Since we
>  have custominputformat which   has issplittable return false always each
> of  these files would be processed  by  a different mapper.How could I make
> sure that  file1 is processed   before  file2(I want the oldest file to  be
> processed first).Is this possible?.
>
> Thanks,
> Nishan
>

Re: Sorting the inputSplits

Posted by Gera Shegalov <ge...@shegalov.com>.
Can you clarify the requirement "processed first"? Maps run in parallel
without any ordering guarantees. If you want to affect the mapping
file->split number, you can implement your own getSplits in the custom
input format and return splits ordered anyway your like.

On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <ch...@gmail.com> wrote:

> Hey folks,
>
> Is their a way to sort the input splits  in map reduce.We have a case
> where there are two files file1 and file2 in the input directory.Since we
>  have custominputformat which   has issplittable return false always each
> of  these files would be processed  by  a different mapper.How could I make
> sure that  file1 is processed   before  file2(I want the oldest file to  be
> processed first).Is this possible?.
>
> Thanks,
> Nishan
>