You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Panshul Whisper <ou...@gmail.com> on 2013/01/15 15:25:56 UTC

Hadoop execution sequence

Hello,

I was wondering if hadoop performs the map reduce operations on the data in
maintaining he order or sequence of data in which it received the data.
I have a hadoop cluster that is receiving json files.. Which are processed
and then stored on base.
For correct calculation it is essential for the json files to be processed
on the order they are received. How can I make sure this happens.

Thanking you,

Regards,
Ouch Whisper
01010101010

RE: Hadoop execution sequence

Posted by John Lilley <jo...@redpoint.net>.

I think it will help for Ouch to clarify what is meant by "in order".  If one JSON file must be completely processed before the next file starts, there is not much point to using MapReduce at all, since your problem cannot be partitioned.  On the other hand, there may be ways around this, for example:

*         Use the file timestamps as keys into whatever the target data store is, so that JSON->records->insert can proceed in parallel.

*         Perform a pass on the JSON files, generating "sequence numbers" based on the file times.  Then process MR jobs where the {Sequence,Filename} is the tuple into mapper, and some stream of {Sequence,FileData} is on output.  The sort/shuffle by Sequence would present a time-ordered stream to the reducers.  Then if absolute order is needed, use a single reducer.
john

Caveat: I may not know what I am talking about ;-)

From: Mahesh Balija [mailto:balijamahesh.mca@gmail.com]
Sent: Tuesday, January 15, 2013 7:47 AM
To: user@hadoop.apache.org
Subject: Re: Hadoop execution sequence

As per the Mapreduce behavior, mapper will process all the input file(s) in parallel. i.e., no order is guaranteed among the input files.
If you want to process each file separately and maintain the order  then you need to process each file separately (in an independent mapreduce job) so that your client is responsible for processing individual file in order.

Best,
Mahesh Balija,
Calsoft Labs.
On Tue, Jan 15, 2013 at 7:55 PM, Panshul Whisper <ou...@gmail.com>> wrote:

Hello,

I was wondering if hadoop performs the map reduce operations on the data in maintaining he order or sequence of data in which it received the data.
I have a hadoop cluster that is receiving json files.. Which are processed and then stored on base.
For correct calculation it is essential for the json files to be processed on the order they are received. How can I make sure this happens.

Thanking you,

Regards,
Ouch Whisper
01010101010

RE: Hadoop execution sequence

Posted by John Lilley <jo...@redpoint.net>.

I think it will help for Ouch to clarify what is meant by "in order".  If one JSON file must be completely processed before the next file starts, there is not much point to using MapReduce at all, since your problem cannot be partitioned.  On the other hand, there may be ways around this, for example:

*         Use the file timestamps as keys into whatever the target data store is, so that JSON->records->insert can proceed in parallel.

*         Perform a pass on the JSON files, generating "sequence numbers" based on the file times.  Then process MR jobs where the {Sequence,Filename} is the tuple into mapper, and some stream of {Sequence,FileData} is on output.  The sort/shuffle by Sequence would present a time-ordered stream to the reducers.  Then if absolute order is needed, use a single reducer.
john

Caveat: I may not know what I am talking about ;-)

From: Mahesh Balija [mailto:balijamahesh.mca@gmail.com]
Sent: Tuesday, January 15, 2013 7:47 AM
To: user@hadoop.apache.org
Subject: Re: Hadoop execution sequence

As per the Mapreduce behavior, mapper will process all the input file(s) in parallel. i.e., no order is guaranteed among the input files.
If you want to process each file separately and maintain the order  then you need to process each file separately (in an independent mapreduce job) so that your client is responsible for processing individual file in order.

Best,
Mahesh Balija,
Calsoft Labs.
On Tue, Jan 15, 2013 at 7:55 PM, Panshul Whisper <ou...@gmail.com>> wrote:

Hello,

I was wondering if hadoop performs the map reduce operations on the data in maintaining he order or sequence of data in which it received the data.
I have a hadoop cluster that is receiving json files.. Which are processed and then stored on base.
For correct calculation it is essential for the json files to be processed on the order they are received. How can I make sure this happens.

Thanking you,

Regards,
Ouch Whisper
01010101010

RE: Hadoop execution sequence

Posted by John Lilley <jo...@redpoint.net>.

I think it will help for Ouch to clarify what is meant by "in order".  If one JSON file must be completely processed before the next file starts, there is not much point to using MapReduce at all, since your problem cannot be partitioned.  On the other hand, there may be ways around this, for example:

*         Use the file timestamps as keys into whatever the target data store is, so that JSON->records->insert can proceed in parallel.

*         Perform a pass on the JSON files, generating "sequence numbers" based on the file times.  Then process MR jobs where the {Sequence,Filename} is the tuple into mapper, and some stream of {Sequence,FileData} is on output.  The sort/shuffle by Sequence would present a time-ordered stream to the reducers.  Then if absolute order is needed, use a single reducer.
john

Caveat: I may not know what I am talking about ;-)

From: Mahesh Balija [mailto:balijamahesh.mca@gmail.com]
Sent: Tuesday, January 15, 2013 7:47 AM
To: user@hadoop.apache.org
Subject: Re: Hadoop execution sequence

As per the Mapreduce behavior, mapper will process all the input file(s) in parallel. i.e., no order is guaranteed among the input files.
If you want to process each file separately and maintain the order  then you need to process each file separately (in an independent mapreduce job) so that your client is responsible for processing individual file in order.

Best,
Mahesh Balija,
Calsoft Labs.
On Tue, Jan 15, 2013 at 7:55 PM, Panshul Whisper <ou...@gmail.com>> wrote:

Hello,

I was wondering if hadoop performs the map reduce operations on the data in maintaining he order or sequence of data in which it received the data.
I have a hadoop cluster that is receiving json files.. Which are processed and then stored on base.
For correct calculation it is essential for the json files to be processed on the order they are received. How can I make sure this happens.

Thanking you,

Regards,
Ouch Whisper
01010101010

RE: Hadoop execution sequence

Posted by John Lilley <jo...@redpoint.net>.

I think it will help for Ouch to clarify what is meant by "in order".  If one JSON file must be completely processed before the next file starts, there is not much point to using MapReduce at all, since your problem cannot be partitioned.  On the other hand, there may be ways around this, for example:

*         Use the file timestamps as keys into whatever the target data store is, so that JSON->records->insert can proceed in parallel.

*         Perform a pass on the JSON files, generating "sequence numbers" based on the file times.  Then process MR jobs where the {Sequence,Filename} is the tuple into mapper, and some stream of {Sequence,FileData} is on output.  The sort/shuffle by Sequence would present a time-ordered stream to the reducers.  Then if absolute order is needed, use a single reducer.
john

Caveat: I may not know what I am talking about ;-)

From: Mahesh Balija [mailto:balijamahesh.mca@gmail.com]
Sent: Tuesday, January 15, 2013 7:47 AM
To: user@hadoop.apache.org
Subject: Re: Hadoop execution sequence

As per the Mapreduce behavior, mapper will process all the input file(s) in parallel. i.e., no order is guaranteed among the input files.
If you want to process each file separately and maintain the order  then you need to process each file separately (in an independent mapreduce job) so that your client is responsible for processing individual file in order.

Best,
Mahesh Balija,
Calsoft Labs.
On Tue, Jan 15, 2013 at 7:55 PM, Panshul Whisper <ou...@gmail.com>> wrote:

Hello,

I was wondering if hadoop performs the map reduce operations on the data in maintaining he order or sequence of data in which it received the data.
I have a hadoop cluster that is receiving json files.. Which are processed and then stored on base.
For correct calculation it is essential for the json files to be processed on the order they are received. How can I make sure this happens.

Thanking you,

Regards,
Ouch Whisper
01010101010

Re: Hadoop execution sequence

Posted by Mahesh Balija <ba...@gmail.com>.

As per the Mapreduce behavior, mapper will process all the input file(s) in
parallel. i.e., no order is guaranteed among the input files.
If you want to process each file separately and maintain the order  then
you need to process each file separately (in an independent mapreduce job)
so that your client is responsible for processing individual file in order.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 15, 2013 at 7:55 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
>
> I was wondering if hadoop performs the map reduce operations on the data
> in maintaining he order or sequence of data in which it received the data.
> I have a hadoop cluster that is receiving json files.. Which are processed
> and then stored on base.
> For correct calculation it is essential for the json files to be processed
> on the order they are received. How can I make sure this happens.
>
> Thanking you,
>
> Regards,
> Ouch Whisper
> 01010101010
>

Re: Hadoop execution sequence

Posted by Mahesh Balija <ba...@gmail.com>.

As per the Mapreduce behavior, mapper will process all the input file(s) in
parallel. i.e., no order is guaranteed among the input files.
If you want to process each file separately and maintain the order  then
you need to process each file separately (in an independent mapreduce job)
so that your client is responsible for processing individual file in order.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 15, 2013 at 7:55 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
>
> I was wondering if hadoop performs the map reduce operations on the data
> in maintaining he order or sequence of data in which it received the data.
> I have a hadoop cluster that is receiving json files.. Which are processed
> and then stored on base.
> For correct calculation it is essential for the json files to be processed
> on the order they are received. How can I make sure this happens.
>
> Thanking you,
>
> Regards,
> Ouch Whisper
> 01010101010
>

Re: Hadoop execution sequence

Posted by Mahesh Balija <ba...@gmail.com>.

As per the Mapreduce behavior, mapper will process all the input file(s) in
parallel. i.e., no order is guaranteed among the input files.
If you want to process each file separately and maintain the order  then
you need to process each file separately (in an independent mapreduce job)
so that your client is responsible for processing individual file in order.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 15, 2013 at 7:55 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
>
> I was wondering if hadoop performs the map reduce operations on the data
> in maintaining he order or sequence of data in which it received the data.
> I have a hadoop cluster that is receiving json files.. Which are processed
> and then stored on base.
> For correct calculation it is essential for the json files to be processed
> on the order they are received. How can I make sure this happens.
>
> Thanking you,
>
> Regards,
> Ouch Whisper
> 01010101010
>

Re: Hadoop execution sequence

Posted by Mahesh Balija <ba...@gmail.com>.

As per the Mapreduce behavior, mapper will process all the input file(s) in
parallel. i.e., no order is guaranteed among the input files.
If you want to process each file separately and maintain the order  then
you need to process each file separately (in an independent mapreduce job)
so that your client is responsible for processing individual file in order.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 15, 2013 at 7:55 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
>
> I was wondering if hadoop performs the map reduce operations on the data
> in maintaining he order or sequence of data in which it received the data.
> I have a hadoop cluster that is receiving json files.. Which are processed
> and then stored on base.
> For correct calculation it is essential for the json files to be processed
> on the order they are received. How can I make sure this happens.
>
> Thanking you,
>
> Regards,
> Ouch Whisper
> 01010101010
>