You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Adrian CAPDEFIER <ch...@gmail.com> on 2013/09/12 15:36:51 UTC

chaining (the output of) jobs/ reducers

Howdy,

My application requires 2 distinct processing steps (reducers) to be
performed on the input data. The first operation generates changes the key
values and, records that had different keys in step 1 can end up having the
same key in step 2.

The heavy lifting of the operation is in step1 and step2 only combines
records where keys were changed.

In short the overview is:
Sequential file -> Step 1 -> Step 2 -> Output.


To implement this in hadoop, it seems that I need to create a separate job
for each step.

Now I assumed, there would some sort of job management under hadoop to link
Job 1 and 2, but the only thing I could find was related to job scheduling
and nothing on how to synchronize the input/output of the linked jobs.



The only crude solution that I can think of is to use a temporary file
under HDFS, but even so I'm not sure if this will work.

The overview of the process would be:
Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
(key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
(key2, value 3)] => output.

Is there a better way to pass the output from Job A as input to Job B (e.g.
using network streams or some built in java classes that don't do disk
i/o)?



The temporary file solution will work in a single node configuration, but
I'm not sure about an MPP config.

Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
both jobs run on all 4 nodes - will HDFS be able to redistribute
automagically the records between nodes or does this need to be coded
somehow?

Re: chaining (the output of) jobs/ reducers

Posted by Venkata K Pisupat <kr...@gmail.com>.
Cascading would a good option in case you have a complex flow. However, in your case, you are trying to chain two jobs only. I would suggest you to follow these steps. 

1. The output directory of Job1 would be set at the input directory for Job2. 
2. Launch Job1 using the new API. In launcher program, instead of using Jobconf and JobClient for running job, use Job class. To run the job, invoke Job.waitForcompletion(true) on Job1. This ensures to block the program until Job1 is run completely. 
3. Optionally, you can combine the individual output files generated by each reducer (if you have more than 1 reducer task) into one or more files. 
4. Next step would be to launch Job2. 

The output of Job1 is written to HDFS and therefore, you will not have any issues while Job2 reads the input (Job1's output). 





On Sep 12, 2013, at 12:02 PM, Adrian CAPDEFIER <ch...@gmail.com> wrote:

> Thanks Bryan.
> 
> Yes, I am using hadoop + hdfs.
> 
> If I understand your point, hadoop tries to start the mapping processes on nodes where the data is local and if that's not possible, then it is hdfs that replicates the data to the mapper nodes? 
> 
> I expected to have to set up this in the code and I completely ignored HDFS; I guess it's a case of not seeing the forest from all the trees!
> 
> 
> On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <bb...@hubspot.com> wrote:
> It really comes down to the following:
> 
> In Job A set mapred.output.dir to some directory X.
> In Job B set mapred.input.dir to the same directory X.
> 
> For Job A, do context.write() as normally, and each reducer will create an output file in mapred.output.dir.  Then in Job B each of those will correspond to a mapper.
> 
> Of course you need to make sure your input and output formats, as well as input and output keys/values, match up between the two jobs as well.
> 
> If you are using HDFS, which it seems you are, the directories specified can be HDFS directories.  In that case, with a replication factor of 3, each of these output files will exist on 3 nodes.  Hadoop and HDFS will do the work to ensure that the mappers in the second job do as good a job as possible to be data or rack-local.
> 
> 
> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <ch...@gmail.com> wrote:
> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd prefer to keep, if possible, everything as close to the hadoop libraries.
> 
> I am sure I am overlooking something basic as repartitioning is a fairly common operation in MPP environments.
> 
> 
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com> wrote:
> If you want to stay in Java look at Cascading. Pig is also helpful. I think there are other (Spring integration maybe?) but I'm not familiar with them enough to make a recommendation.
> 
> Note that with Cascading and Pig you don't write 'map reduce' you write logic and they map it to the various mapper/reducer steps automatically.
> 
> Hope this helps,
> 
> Chris
> 
> 
> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com> wrote:
> Howdy,
> 
> My application requires 2 distinct processing steps (reducers) to be performed on the input data. The first operation generates changes the key values and, records that had different keys in step 1 can end up having the same key in step 2.
> 
> The heavy lifting of the operation is in step1 and step2 only combines records where keys were changed.
> 
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
> 
> 
> To implement this in hadoop, it seems that I need to create a separate job for each step. 
> 
> Now I assumed, there would some sort of job management under hadoop to link Job 1 and 2, but the only thing I could find was related to job scheduling and nothing on how to synchronize the input/output of the linked jobs.
> 
> 
> 
> The only crude solution that I can think of is to use a temporary file under HDFS, but even so I'm not sure if this will work.
> 
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer (key2, value 3)] => output.
> 
> Is there a better way to pass the output from Job A as input to Job B (e.g. using network streams or some built in java classes that don't do disk i/o)? 
> 
> 
> 
> The temporary file solution will work in a single node configuration, but I'm not sure about an MPP config.
> 
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redistribute automagically the records between nodes or does this need to be coded somehow? 
> 
> 
> 
> 


Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thanks Bryan. This is great stuff!


On Thu, Sep 12, 2013 at 8:49 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> Hey Adrian,
>
> To clarify, the replication happens on *write*.  So as you write output
> from the reducer of Job A, you are writing into hdfs.  Part of that write
> path is replicating the data to 2 additional hosts in the cluster (local +
> 2, this is configured by dfs.replication configuration value).  So by the
> time Job B starts, hadoop has 3 options where each mapper can run and be
> data-local.  Hadoop will do all the work to try to make everything as local
> as possible.
>
> You'll be able to see from the counters on the job how successful hadoop
> was at placing your mappers.  See the counters "Data-local map tasks" and
> "Rack-local map tasks".  Rack-local being those where hadoop was not able
> to place the mapper on the same host as the data, but was at least able to
> keep it within the same rack.
>
> All of this is dependent a proper topology configuration, both in your
> NameNode and JobTracker.
>
>
> On Thu, Sep 12, 2013 at 3:02 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Thanks Bryan.
>>
>> Yes, I am using hadoop + hdfs.
>>
>> If I understand your point, hadoop tries to start the mapping processes
>> on nodes where the data is local and if that's not possible, then it is
>> hdfs that replicates the data to the mapper nodes?
>>
>> I expected to have to set up this in the code and I completely ignored
>> HDFS; I guess it's a case of not seeing the forest from all the trees!
>>
>>
>>
>>  On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <
>> bbeaudreault@hubspot.com> wrote:
>>
>>> It really comes down to the following:
>>>
>>> In Job A set mapred.output.dir to some directory X.
>>> In Job B set mapred.input.dir to the same directory X.
>>>
>>> For Job A, do context.write() as normally, and each reducer will create
>>> an output file in mapred.output.dir.  Then in Job B each of those will
>>> correspond to a mapper.
>>>
>>> Of course you need to make sure your input and output formats, as well
>>> as input and output keys/values, match up between the two jobs as well.
>>>
>>> If you are using HDFS, which it seems you are, the directories specified
>>> can be HDFS directories.  In that case, with a replication factor of 3,
>>> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
>>> the work to ensure that the mappers in the second job do as good a job as
>>> possible to be data or rack-local.
>>>
>>>
>>> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Thank you, Chris. I will look at Cascading and Pig, but for starters
>>>> I'd prefer to keep, if possible, everything as close to the hadoop
>>>> libraries.
>>>>
>>>> I am sure I am overlooking something basic as repartitioning is a
>>>> fairly common operation in MPP environments.
>>>>
>>>>
>>>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>>>
>>>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>>>> them enough to make a recommendation.
>>>>>
>>>>> Note that with Cascading and Pig you don't write 'map reduce' you
>>>>> write logic and they map it to the various mapper/reducer steps
>>>>> automatically.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>>>> chivas314159@gmail.com> wrote:
>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> My application requires 2 distinct processing steps (reducers) to be
>>>>>> performed on the input data. The first operation generates changes the key
>>>>>> values and, records that had different keys in step 1 can end up having the
>>>>>> same key in step 2.
>>>>>>
>>>>>> The heavy lifting of the operation is in step1 and step2 only
>>>>>> combines records where keys were changed.
>>>>>>
>>>>>> In short the overview is:
>>>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>>>
>>>>>>
>>>>>> To implement this in hadoop, it seems that I need to create a
>>>>>> separate job for each step.
>>>>>>
>>>>>> Now I assumed, there would some sort of job management under hadoop
>>>>>> to link Job 1 and 2, but the only thing I could find was related to job
>>>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>>>> jobs.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The only crude solution that I can think of is to use a temporary
>>>>>> file under HDFS, but even so I'm not sure if this will work.
>>>>>>
>>>>>> The overview of the process would be:
>>>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) =>
>>>>>> ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2,
>>>>>> value2) => Reducer (key2, value 3)] => output.
>>>>>>
>>>>>> Is there a better way to pass the output from Job A as input to Job B
>>>>>> (e.g. using network streams or some built in java classes that don't do
>>>>>> disk i/o)?
>>>>>>
>>>>>>
>>>>>>
>>>>>> The temporary file solution will work in a single node configuration,
>>>>>> but I'm not sure about an MPP config.
>>>>>>
>>>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>>>> automagically the records between nodes or does this need to be coded
>>>>>> somehow?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thanks Bryan. This is great stuff!


On Thu, Sep 12, 2013 at 8:49 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> Hey Adrian,
>
> To clarify, the replication happens on *write*.  So as you write output
> from the reducer of Job A, you are writing into hdfs.  Part of that write
> path is replicating the data to 2 additional hosts in the cluster (local +
> 2, this is configured by dfs.replication configuration value).  So by the
> time Job B starts, hadoop has 3 options where each mapper can run and be
> data-local.  Hadoop will do all the work to try to make everything as local
> as possible.
>
> You'll be able to see from the counters on the job how successful hadoop
> was at placing your mappers.  See the counters "Data-local map tasks" and
> "Rack-local map tasks".  Rack-local being those where hadoop was not able
> to place the mapper on the same host as the data, but was at least able to
> keep it within the same rack.
>
> All of this is dependent a proper topology configuration, both in your
> NameNode and JobTracker.
>
>
> On Thu, Sep 12, 2013 at 3:02 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Thanks Bryan.
>>
>> Yes, I am using hadoop + hdfs.
>>
>> If I understand your point, hadoop tries to start the mapping processes
>> on nodes where the data is local and if that's not possible, then it is
>> hdfs that replicates the data to the mapper nodes?
>>
>> I expected to have to set up this in the code and I completely ignored
>> HDFS; I guess it's a case of not seeing the forest from all the trees!
>>
>>
>>
>>  On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <
>> bbeaudreault@hubspot.com> wrote:
>>
>>> It really comes down to the following:
>>>
>>> In Job A set mapred.output.dir to some directory X.
>>> In Job B set mapred.input.dir to the same directory X.
>>>
>>> For Job A, do context.write() as normally, and each reducer will create
>>> an output file in mapred.output.dir.  Then in Job B each of those will
>>> correspond to a mapper.
>>>
>>> Of course you need to make sure your input and output formats, as well
>>> as input and output keys/values, match up between the two jobs as well.
>>>
>>> If you are using HDFS, which it seems you are, the directories specified
>>> can be HDFS directories.  In that case, with a replication factor of 3,
>>> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
>>> the work to ensure that the mappers in the second job do as good a job as
>>> possible to be data or rack-local.
>>>
>>>
>>> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Thank you, Chris. I will look at Cascading and Pig, but for starters
>>>> I'd prefer to keep, if possible, everything as close to the hadoop
>>>> libraries.
>>>>
>>>> I am sure I am overlooking something basic as repartitioning is a
>>>> fairly common operation in MPP environments.
>>>>
>>>>
>>>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>>>
>>>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>>>> them enough to make a recommendation.
>>>>>
>>>>> Note that with Cascading and Pig you don't write 'map reduce' you
>>>>> write logic and they map it to the various mapper/reducer steps
>>>>> automatically.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>>>> chivas314159@gmail.com> wrote:
>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> My application requires 2 distinct processing steps (reducers) to be
>>>>>> performed on the input data. The first operation generates changes the key
>>>>>> values and, records that had different keys in step 1 can end up having the
>>>>>> same key in step 2.
>>>>>>
>>>>>> The heavy lifting of the operation is in step1 and step2 only
>>>>>> combines records where keys were changed.
>>>>>>
>>>>>> In short the overview is:
>>>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>>>
>>>>>>
>>>>>> To implement this in hadoop, it seems that I need to create a
>>>>>> separate job for each step.
>>>>>>
>>>>>> Now I assumed, there would some sort of job management under hadoop
>>>>>> to link Job 1 and 2, but the only thing I could find was related to job
>>>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>>>> jobs.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The only crude solution that I can think of is to use a temporary
>>>>>> file under HDFS, but even so I'm not sure if this will work.
>>>>>>
>>>>>> The overview of the process would be:
>>>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) =>
>>>>>> ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2,
>>>>>> value2) => Reducer (key2, value 3)] => output.
>>>>>>
>>>>>> Is there a better way to pass the output from Job A as input to Job B
>>>>>> (e.g. using network streams or some built in java classes that don't do
>>>>>> disk i/o)?
>>>>>>
>>>>>>
>>>>>>
>>>>>> The temporary file solution will work in a single node configuration,
>>>>>> but I'm not sure about an MPP config.
>>>>>>
>>>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>>>> automagically the records between nodes or does this need to be coded
>>>>>> somehow?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thanks Bryan. This is great stuff!


On Thu, Sep 12, 2013 at 8:49 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> Hey Adrian,
>
> To clarify, the replication happens on *write*.  So as you write output
> from the reducer of Job A, you are writing into hdfs.  Part of that write
> path is replicating the data to 2 additional hosts in the cluster (local +
> 2, this is configured by dfs.replication configuration value).  So by the
> time Job B starts, hadoop has 3 options where each mapper can run and be
> data-local.  Hadoop will do all the work to try to make everything as local
> as possible.
>
> You'll be able to see from the counters on the job how successful hadoop
> was at placing your mappers.  See the counters "Data-local map tasks" and
> "Rack-local map tasks".  Rack-local being those where hadoop was not able
> to place the mapper on the same host as the data, but was at least able to
> keep it within the same rack.
>
> All of this is dependent a proper topology configuration, both in your
> NameNode and JobTracker.
>
>
> On Thu, Sep 12, 2013 at 3:02 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Thanks Bryan.
>>
>> Yes, I am using hadoop + hdfs.
>>
>> If I understand your point, hadoop tries to start the mapping processes
>> on nodes where the data is local and if that's not possible, then it is
>> hdfs that replicates the data to the mapper nodes?
>>
>> I expected to have to set up this in the code and I completely ignored
>> HDFS; I guess it's a case of not seeing the forest from all the trees!
>>
>>
>>
>>  On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <
>> bbeaudreault@hubspot.com> wrote:
>>
>>> It really comes down to the following:
>>>
>>> In Job A set mapred.output.dir to some directory X.
>>> In Job B set mapred.input.dir to the same directory X.
>>>
>>> For Job A, do context.write() as normally, and each reducer will create
>>> an output file in mapred.output.dir.  Then in Job B each of those will
>>> correspond to a mapper.
>>>
>>> Of course you need to make sure your input and output formats, as well
>>> as input and output keys/values, match up between the two jobs as well.
>>>
>>> If you are using HDFS, which it seems you are, the directories specified
>>> can be HDFS directories.  In that case, with a replication factor of 3,
>>> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
>>> the work to ensure that the mappers in the second job do as good a job as
>>> possible to be data or rack-local.
>>>
>>>
>>> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Thank you, Chris. I will look at Cascading and Pig, but for starters
>>>> I'd prefer to keep, if possible, everything as close to the hadoop
>>>> libraries.
>>>>
>>>> I am sure I am overlooking something basic as repartitioning is a
>>>> fairly common operation in MPP environments.
>>>>
>>>>
>>>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>>>
>>>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>>>> them enough to make a recommendation.
>>>>>
>>>>> Note that with Cascading and Pig you don't write 'map reduce' you
>>>>> write logic and they map it to the various mapper/reducer steps
>>>>> automatically.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>>>> chivas314159@gmail.com> wrote:
>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> My application requires 2 distinct processing steps (reducers) to be
>>>>>> performed on the input data. The first operation generates changes the key
>>>>>> values and, records that had different keys in step 1 can end up having the
>>>>>> same key in step 2.
>>>>>>
>>>>>> The heavy lifting of the operation is in step1 and step2 only
>>>>>> combines records where keys were changed.
>>>>>>
>>>>>> In short the overview is:
>>>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>>>
>>>>>>
>>>>>> To implement this in hadoop, it seems that I need to create a
>>>>>> separate job for each step.
>>>>>>
>>>>>> Now I assumed, there would some sort of job management under hadoop
>>>>>> to link Job 1 and 2, but the only thing I could find was related to job
>>>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>>>> jobs.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The only crude solution that I can think of is to use a temporary
>>>>>> file under HDFS, but even so I'm not sure if this will work.
>>>>>>
>>>>>> The overview of the process would be:
>>>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) =>
>>>>>> ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2,
>>>>>> value2) => Reducer (key2, value 3)] => output.
>>>>>>
>>>>>> Is there a better way to pass the output from Job A as input to Job B
>>>>>> (e.g. using network streams or some built in java classes that don't do
>>>>>> disk i/o)?
>>>>>>
>>>>>>
>>>>>>
>>>>>> The temporary file solution will work in a single node configuration,
>>>>>> but I'm not sure about an MPP config.
>>>>>>
>>>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>>>> automagically the records between nodes or does this need to be coded
>>>>>> somehow?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thanks Bryan. This is great stuff!


On Thu, Sep 12, 2013 at 8:49 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> Hey Adrian,
>
> To clarify, the replication happens on *write*.  So as you write output
> from the reducer of Job A, you are writing into hdfs.  Part of that write
> path is replicating the data to 2 additional hosts in the cluster (local +
> 2, this is configured by dfs.replication configuration value).  So by the
> time Job B starts, hadoop has 3 options where each mapper can run and be
> data-local.  Hadoop will do all the work to try to make everything as local
> as possible.
>
> You'll be able to see from the counters on the job how successful hadoop
> was at placing your mappers.  See the counters "Data-local map tasks" and
> "Rack-local map tasks".  Rack-local being those where hadoop was not able
> to place the mapper on the same host as the data, but was at least able to
> keep it within the same rack.
>
> All of this is dependent a proper topology configuration, both in your
> NameNode and JobTracker.
>
>
> On Thu, Sep 12, 2013 at 3:02 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Thanks Bryan.
>>
>> Yes, I am using hadoop + hdfs.
>>
>> If I understand your point, hadoop tries to start the mapping processes
>> on nodes where the data is local and if that's not possible, then it is
>> hdfs that replicates the data to the mapper nodes?
>>
>> I expected to have to set up this in the code and I completely ignored
>> HDFS; I guess it's a case of not seeing the forest from all the trees!
>>
>>
>>
>>  On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <
>> bbeaudreault@hubspot.com> wrote:
>>
>>> It really comes down to the following:
>>>
>>> In Job A set mapred.output.dir to some directory X.
>>> In Job B set mapred.input.dir to the same directory X.
>>>
>>> For Job A, do context.write() as normally, and each reducer will create
>>> an output file in mapred.output.dir.  Then in Job B each of those will
>>> correspond to a mapper.
>>>
>>> Of course you need to make sure your input and output formats, as well
>>> as input and output keys/values, match up between the two jobs as well.
>>>
>>> If you are using HDFS, which it seems you are, the directories specified
>>> can be HDFS directories.  In that case, with a replication factor of 3,
>>> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
>>> the work to ensure that the mappers in the second job do as good a job as
>>> possible to be data or rack-local.
>>>
>>>
>>> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Thank you, Chris. I will look at Cascading and Pig, but for starters
>>>> I'd prefer to keep, if possible, everything as close to the hadoop
>>>> libraries.
>>>>
>>>> I am sure I am overlooking something basic as repartitioning is a
>>>> fairly common operation in MPP environments.
>>>>
>>>>
>>>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>>>
>>>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>>>> them enough to make a recommendation.
>>>>>
>>>>> Note that with Cascading and Pig you don't write 'map reduce' you
>>>>> write logic and they map it to the various mapper/reducer steps
>>>>> automatically.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>>>> chivas314159@gmail.com> wrote:
>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> My application requires 2 distinct processing steps (reducers) to be
>>>>>> performed on the input data. The first operation generates changes the key
>>>>>> values and, records that had different keys in step 1 can end up having the
>>>>>> same key in step 2.
>>>>>>
>>>>>> The heavy lifting of the operation is in step1 and step2 only
>>>>>> combines records where keys were changed.
>>>>>>
>>>>>> In short the overview is:
>>>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>>>
>>>>>>
>>>>>> To implement this in hadoop, it seems that I need to create a
>>>>>> separate job for each step.
>>>>>>
>>>>>> Now I assumed, there would some sort of job management under hadoop
>>>>>> to link Job 1 and 2, but the only thing I could find was related to job
>>>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>>>> jobs.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The only crude solution that I can think of is to use a temporary
>>>>>> file under HDFS, but even so I'm not sure if this will work.
>>>>>>
>>>>>> The overview of the process would be:
>>>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) =>
>>>>>> ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2,
>>>>>> value2) => Reducer (key2, value 3)] => output.
>>>>>>
>>>>>> Is there a better way to pass the output from Job A as input to Job B
>>>>>> (e.g. using network streams or some built in java classes that don't do
>>>>>> disk i/o)?
>>>>>>
>>>>>>
>>>>>>
>>>>>> The temporary file solution will work in a single node configuration,
>>>>>> but I'm not sure about an MPP config.
>>>>>>
>>>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>>>> automagically the records between nodes or does this need to be coded
>>>>>> somehow?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Bryan Beaudreault <bb...@hubspot.com>.
Hey Adrian,

To clarify, the replication happens on *write*.  So as you write output
from the reducer of Job A, you are writing into hdfs.  Part of that write
path is replicating the data to 2 additional hosts in the cluster (local +
2, this is configured by dfs.replication configuration value).  So by the
time Job B starts, hadoop has 3 options where each mapper can run and be
data-local.  Hadoop will do all the work to try to make everything as local
as possible.

You'll be able to see from the counters on the job how successful hadoop
was at placing your mappers.  See the counters "Data-local map tasks" and
"Rack-local map tasks".  Rack-local being those where hadoop was not able
to place the mapper on the same host as the data, but was at least able to
keep it within the same rack.

All of this is dependent a proper topology configuration, both in your
NameNode and JobTracker.


On Thu, Sep 12, 2013 at 3:02 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Thanks Bryan.
>
> Yes, I am using hadoop + hdfs.
>
> If I understand your point, hadoop tries to start the mapping processes on
> nodes where the data is local and if that's not possible, then it is hdfs
> that replicates the data to the mapper nodes?
>
> I expected to have to set up this in the code and I completely ignored
> HDFS; I guess it's a case of not seeing the forest from all the trees!
>
>
>
>  On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <
> bbeaudreault@hubspot.com> wrote:
>
>> It really comes down to the following:
>>
>> In Job A set mapred.output.dir to some directory X.
>> In Job B set mapred.input.dir to the same directory X.
>>
>> For Job A, do context.write() as normally, and each reducer will create
>> an output file in mapred.output.dir.  Then in Job B each of those will
>> correspond to a mapper.
>>
>> Of course you need to make sure your input and output formats, as well as
>> input and output keys/values, match up between the two jobs as well.
>>
>> If you are using HDFS, which it seems you are, the directories specified
>> can be HDFS directories.  In that case, with a replication factor of 3,
>> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
>> the work to ensure that the mappers in the second job do as good a job as
>> possible to be data or rack-local.
>>
>>
>> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <
>> chivas314159@gmail.com> wrote:
>>
>>> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
>>> prefer to keep, if possible, everything as close to the hadoop libraries.
>>>
>>> I am sure I am overlooking something basic as repartitioning is a fairly
>>> common operation in MPP environments.
>>>
>>>
>>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>>
>>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>>> them enough to make a recommendation.
>>>>
>>>> Note that with Cascading and Pig you don't write 'map reduce' you write
>>>> logic and they map it to the various mapper/reducer steps automatically.
>>>>
>>>> Hope this helps,
>>>>
>>>> Chris
>>>>
>>>>
>>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>>> chivas314159@gmail.com> wrote:
>>>>
>>>>> Howdy,
>>>>>
>>>>> My application requires 2 distinct processing steps (reducers) to be
>>>>> performed on the input data. The first operation generates changes the key
>>>>> values and, records that had different keys in step 1 can end up having the
>>>>> same key in step 2.
>>>>>
>>>>> The heavy lifting of the operation is in step1 and step2 only combines
>>>>> records where keys were changed.
>>>>>
>>>>> In short the overview is:
>>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>>
>>>>>
>>>>> To implement this in hadoop, it seems that I need to create a separate
>>>>> job for each step.
>>>>>
>>>>> Now I assumed, there would some sort of job management under hadoop to
>>>>> link Job 1 and 2, but the only thing I could find was related to job
>>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>>> jobs.
>>>>>
>>>>>
>>>>>
>>>>> The only crude solution that I can think of is to use a temporary file
>>>>> under HDFS, but even so I'm not sure if this will work.
>>>>>
>>>>> The overview of the process would be:
>>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) =>
>>>>> ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2,
>>>>> value2) => Reducer (key2, value 3)] => output.
>>>>>
>>>>> Is there a better way to pass the output from Job A as input to Job B
>>>>> (e.g. using network streams or some built in java classes that don't do
>>>>> disk i/o)?
>>>>>
>>>>>
>>>>>
>>>>> The temporary file solution will work in a single node configuration,
>>>>> but I'm not sure about an MPP config.
>>>>>
>>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>>> automagically the records between nodes or does this need to be coded
>>>>> somehow?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Venkata K Pisupat <kr...@gmail.com>.
Cascading would a good option in case you have a complex flow. However, in your case, you are trying to chain two jobs only. I would suggest you to follow these steps. 

1. The output directory of Job1 would be set at the input directory for Job2. 
2. Launch Job1 using the new API. In launcher program, instead of using Jobconf and JobClient for running job, use Job class. To run the job, invoke Job.waitForcompletion(true) on Job1. This ensures to block the program until Job1 is run completely. 
3. Optionally, you can combine the individual output files generated by each reducer (if you have more than 1 reducer task) into one or more files. 
4. Next step would be to launch Job2. 

The output of Job1 is written to HDFS and therefore, you will not have any issues while Job2 reads the input (Job1's output). 





On Sep 12, 2013, at 12:02 PM, Adrian CAPDEFIER <ch...@gmail.com> wrote:

> Thanks Bryan.
> 
> Yes, I am using hadoop + hdfs.
> 
> If I understand your point, hadoop tries to start the mapping processes on nodes where the data is local and if that's not possible, then it is hdfs that replicates the data to the mapper nodes? 
> 
> I expected to have to set up this in the code and I completely ignored HDFS; I guess it's a case of not seeing the forest from all the trees!
> 
> 
> On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <bb...@hubspot.com> wrote:
> It really comes down to the following:
> 
> In Job A set mapred.output.dir to some directory X.
> In Job B set mapred.input.dir to the same directory X.
> 
> For Job A, do context.write() as normally, and each reducer will create an output file in mapred.output.dir.  Then in Job B each of those will correspond to a mapper.
> 
> Of course you need to make sure your input and output formats, as well as input and output keys/values, match up between the two jobs as well.
> 
> If you are using HDFS, which it seems you are, the directories specified can be HDFS directories.  In that case, with a replication factor of 3, each of these output files will exist on 3 nodes.  Hadoop and HDFS will do the work to ensure that the mappers in the second job do as good a job as possible to be data or rack-local.
> 
> 
> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <ch...@gmail.com> wrote:
> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd prefer to keep, if possible, everything as close to the hadoop libraries.
> 
> I am sure I am overlooking something basic as repartitioning is a fairly common operation in MPP environments.
> 
> 
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com> wrote:
> If you want to stay in Java look at Cascading. Pig is also helpful. I think there are other (Spring integration maybe?) but I'm not familiar with them enough to make a recommendation.
> 
> Note that with Cascading and Pig you don't write 'map reduce' you write logic and they map it to the various mapper/reducer steps automatically.
> 
> Hope this helps,
> 
> Chris
> 
> 
> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com> wrote:
> Howdy,
> 
> My application requires 2 distinct processing steps (reducers) to be performed on the input data. The first operation generates changes the key values and, records that had different keys in step 1 can end up having the same key in step 2.
> 
> The heavy lifting of the operation is in step1 and step2 only combines records where keys were changed.
> 
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
> 
> 
> To implement this in hadoop, it seems that I need to create a separate job for each step. 
> 
> Now I assumed, there would some sort of job management under hadoop to link Job 1 and 2, but the only thing I could find was related to job scheduling and nothing on how to synchronize the input/output of the linked jobs.
> 
> 
> 
> The only crude solution that I can think of is to use a temporary file under HDFS, but even so I'm not sure if this will work.
> 
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer (key2, value 3)] => output.
> 
> Is there a better way to pass the output from Job A as input to Job B (e.g. using network streams or some built in java classes that don't do disk i/o)? 
> 
> 
> 
> The temporary file solution will work in a single node configuration, but I'm not sure about an MPP config.
> 
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redistribute automagically the records between nodes or does this need to be coded somehow? 
> 
> 
> 
> 


Re: chaining (the output of) jobs/ reducers

Posted by Bryan Beaudreault <bb...@hubspot.com>.
Hey Adrian,

To clarify, the replication happens on *write*.  So as you write output
from the reducer of Job A, you are writing into hdfs.  Part of that write
path is replicating the data to 2 additional hosts in the cluster (local +
2, this is configured by dfs.replication configuration value).  So by the
time Job B starts, hadoop has 3 options where each mapper can run and be
data-local.  Hadoop will do all the work to try to make everything as local
as possible.

You'll be able to see from the counters on the job how successful hadoop
was at placing your mappers.  See the counters "Data-local map tasks" and
"Rack-local map tasks".  Rack-local being those where hadoop was not able
to place the mapper on the same host as the data, but was at least able to
keep it within the same rack.

All of this is dependent a proper topology configuration, both in your
NameNode and JobTracker.


On Thu, Sep 12, 2013 at 3:02 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Thanks Bryan.
>
> Yes, I am using hadoop + hdfs.
>
> If I understand your point, hadoop tries to start the mapping processes on
> nodes where the data is local and if that's not possible, then it is hdfs
> that replicates the data to the mapper nodes?
>
> I expected to have to set up this in the code and I completely ignored
> HDFS; I guess it's a case of not seeing the forest from all the trees!
>
>
>
>  On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <
> bbeaudreault@hubspot.com> wrote:
>
>> It really comes down to the following:
>>
>> In Job A set mapred.output.dir to some directory X.
>> In Job B set mapred.input.dir to the same directory X.
>>
>> For Job A, do context.write() as normally, and each reducer will create
>> an output file in mapred.output.dir.  Then in Job B each of those will
>> correspond to a mapper.
>>
>> Of course you need to make sure your input and output formats, as well as
>> input and output keys/values, match up between the two jobs as well.
>>
>> If you are using HDFS, which it seems you are, the directories specified
>> can be HDFS directories.  In that case, with a replication factor of 3,
>> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
>> the work to ensure that the mappers in the second job do as good a job as
>> possible to be data or rack-local.
>>
>>
>> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <
>> chivas314159@gmail.com> wrote:
>>
>>> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
>>> prefer to keep, if possible, everything as close to the hadoop libraries.
>>>
>>> I am sure I am overlooking something basic as repartitioning is a fairly
>>> common operation in MPP environments.
>>>
>>>
>>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>>
>>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>>> them enough to make a recommendation.
>>>>
>>>> Note that with Cascading and Pig you don't write 'map reduce' you write
>>>> logic and they map it to the various mapper/reducer steps automatically.
>>>>
>>>> Hope this helps,
>>>>
>>>> Chris
>>>>
>>>>
>>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>>> chivas314159@gmail.com> wrote:
>>>>
>>>>> Howdy,
>>>>>
>>>>> My application requires 2 distinct processing steps (reducers) to be
>>>>> performed on the input data. The first operation generates changes the key
>>>>> values and, records that had different keys in step 1 can end up having the
>>>>> same key in step 2.
>>>>>
>>>>> The heavy lifting of the operation is in step1 and step2 only combines
>>>>> records where keys were changed.
>>>>>
>>>>> In short the overview is:
>>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>>
>>>>>
>>>>> To implement this in hadoop, it seems that I need to create a separate
>>>>> job for each step.
>>>>>
>>>>> Now I assumed, there would some sort of job management under hadoop to
>>>>> link Job 1 and 2, but the only thing I could find was related to job
>>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>>> jobs.
>>>>>
>>>>>
>>>>>
>>>>> The only crude solution that I can think of is to use a temporary file
>>>>> under HDFS, but even so I'm not sure if this will work.
>>>>>
>>>>> The overview of the process would be:
>>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) =>
>>>>> ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2,
>>>>> value2) => Reducer (key2, value 3)] => output.
>>>>>
>>>>> Is there a better way to pass the output from Job A as input to Job B
>>>>> (e.g. using network streams or some built in java classes that don't do
>>>>> disk i/o)?
>>>>>
>>>>>
>>>>>
>>>>> The temporary file solution will work in a single node configuration,
>>>>> but I'm not sure about an MPP config.
>>>>>
>>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>>> automagically the records between nodes or does this need to be coded
>>>>> somehow?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Bryan Beaudreault <bb...@hubspot.com>.
Hey Adrian,

To clarify, the replication happens on *write*.  So as you write output
from the reducer of Job A, you are writing into hdfs.  Part of that write
path is replicating the data to 2 additional hosts in the cluster (local +
2, this is configured by dfs.replication configuration value).  So by the
time Job B starts, hadoop has 3 options where each mapper can run and be
data-local.  Hadoop will do all the work to try to make everything as local
as possible.

You'll be able to see from the counters on the job how successful hadoop
was at placing your mappers.  See the counters "Data-local map tasks" and
"Rack-local map tasks".  Rack-local being those where hadoop was not able
to place the mapper on the same host as the data, but was at least able to
keep it within the same rack.

All of this is dependent a proper topology configuration, both in your
NameNode and JobTracker.


On Thu, Sep 12, 2013 at 3:02 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Thanks Bryan.
>
> Yes, I am using hadoop + hdfs.
>
> If I understand your point, hadoop tries to start the mapping processes on
> nodes where the data is local and if that's not possible, then it is hdfs
> that replicates the data to the mapper nodes?
>
> I expected to have to set up this in the code and I completely ignored
> HDFS; I guess it's a case of not seeing the forest from all the trees!
>
>
>
>  On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <
> bbeaudreault@hubspot.com> wrote:
>
>> It really comes down to the following:
>>
>> In Job A set mapred.output.dir to some directory X.
>> In Job B set mapred.input.dir to the same directory X.
>>
>> For Job A, do context.write() as normally, and each reducer will create
>> an output file in mapred.output.dir.  Then in Job B each of those will
>> correspond to a mapper.
>>
>> Of course you need to make sure your input and output formats, as well as
>> input and output keys/values, match up between the two jobs as well.
>>
>> If you are using HDFS, which it seems you are, the directories specified
>> can be HDFS directories.  In that case, with a replication factor of 3,
>> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
>> the work to ensure that the mappers in the second job do as good a job as
>> possible to be data or rack-local.
>>
>>
>> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <
>> chivas314159@gmail.com> wrote:
>>
>>> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
>>> prefer to keep, if possible, everything as close to the hadoop libraries.
>>>
>>> I am sure I am overlooking something basic as repartitioning is a fairly
>>> common operation in MPP environments.
>>>
>>>
>>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>>
>>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>>> them enough to make a recommendation.
>>>>
>>>> Note that with Cascading and Pig you don't write 'map reduce' you write
>>>> logic and they map it to the various mapper/reducer steps automatically.
>>>>
>>>> Hope this helps,
>>>>
>>>> Chris
>>>>
>>>>
>>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>>> chivas314159@gmail.com> wrote:
>>>>
>>>>> Howdy,
>>>>>
>>>>> My application requires 2 distinct processing steps (reducers) to be
>>>>> performed on the input data. The first operation generates changes the key
>>>>> values and, records that had different keys in step 1 can end up having the
>>>>> same key in step 2.
>>>>>
>>>>> The heavy lifting of the operation is in step1 and step2 only combines
>>>>> records where keys were changed.
>>>>>
>>>>> In short the overview is:
>>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>>
>>>>>
>>>>> To implement this in hadoop, it seems that I need to create a separate
>>>>> job for each step.
>>>>>
>>>>> Now I assumed, there would some sort of job management under hadoop to
>>>>> link Job 1 and 2, but the only thing I could find was related to job
>>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>>> jobs.
>>>>>
>>>>>
>>>>>
>>>>> The only crude solution that I can think of is to use a temporary file
>>>>> under HDFS, but even so I'm not sure if this will work.
>>>>>
>>>>> The overview of the process would be:
>>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) =>
>>>>> ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2,
>>>>> value2) => Reducer (key2, value 3)] => output.
>>>>>
>>>>> Is there a better way to pass the output from Job A as input to Job B
>>>>> (e.g. using network streams or some built in java classes that don't do
>>>>> disk i/o)?
>>>>>
>>>>>
>>>>>
>>>>> The temporary file solution will work in a single node configuration,
>>>>> but I'm not sure about an MPP config.
>>>>>
>>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>>> automagically the records between nodes or does this need to be coded
>>>>> somehow?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Bryan Beaudreault <bb...@hubspot.com>.
Hey Adrian,

To clarify, the replication happens on *write*.  So as you write output
from the reducer of Job A, you are writing into hdfs.  Part of that write
path is replicating the data to 2 additional hosts in the cluster (local +
2, this is configured by dfs.replication configuration value).  So by the
time Job B starts, hadoop has 3 options where each mapper can run and be
data-local.  Hadoop will do all the work to try to make everything as local
as possible.

You'll be able to see from the counters on the job how successful hadoop
was at placing your mappers.  See the counters "Data-local map tasks" and
"Rack-local map tasks".  Rack-local being those where hadoop was not able
to place the mapper on the same host as the data, but was at least able to
keep it within the same rack.

All of this is dependent a proper topology configuration, both in your
NameNode and JobTracker.


On Thu, Sep 12, 2013 at 3:02 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Thanks Bryan.
>
> Yes, I am using hadoop + hdfs.
>
> If I understand your point, hadoop tries to start the mapping processes on
> nodes where the data is local and if that's not possible, then it is hdfs
> that replicates the data to the mapper nodes?
>
> I expected to have to set up this in the code and I completely ignored
> HDFS; I guess it's a case of not seeing the forest from all the trees!
>
>
>
>  On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <
> bbeaudreault@hubspot.com> wrote:
>
>> It really comes down to the following:
>>
>> In Job A set mapred.output.dir to some directory X.
>> In Job B set mapred.input.dir to the same directory X.
>>
>> For Job A, do context.write() as normally, and each reducer will create
>> an output file in mapred.output.dir.  Then in Job B each of those will
>> correspond to a mapper.
>>
>> Of course you need to make sure your input and output formats, as well as
>> input and output keys/values, match up between the two jobs as well.
>>
>> If you are using HDFS, which it seems you are, the directories specified
>> can be HDFS directories.  In that case, with a replication factor of 3,
>> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
>> the work to ensure that the mappers in the second job do as good a job as
>> possible to be data or rack-local.
>>
>>
>> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <
>> chivas314159@gmail.com> wrote:
>>
>>> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
>>> prefer to keep, if possible, everything as close to the hadoop libraries.
>>>
>>> I am sure I am overlooking something basic as repartitioning is a fairly
>>> common operation in MPP environments.
>>>
>>>
>>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>>
>>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>>> them enough to make a recommendation.
>>>>
>>>> Note that with Cascading and Pig you don't write 'map reduce' you write
>>>> logic and they map it to the various mapper/reducer steps automatically.
>>>>
>>>> Hope this helps,
>>>>
>>>> Chris
>>>>
>>>>
>>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>>> chivas314159@gmail.com> wrote:
>>>>
>>>>> Howdy,
>>>>>
>>>>> My application requires 2 distinct processing steps (reducers) to be
>>>>> performed on the input data. The first operation generates changes the key
>>>>> values and, records that had different keys in step 1 can end up having the
>>>>> same key in step 2.
>>>>>
>>>>> The heavy lifting of the operation is in step1 and step2 only combines
>>>>> records where keys were changed.
>>>>>
>>>>> In short the overview is:
>>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>>
>>>>>
>>>>> To implement this in hadoop, it seems that I need to create a separate
>>>>> job for each step.
>>>>>
>>>>> Now I assumed, there would some sort of job management under hadoop to
>>>>> link Job 1 and 2, but the only thing I could find was related to job
>>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>>> jobs.
>>>>>
>>>>>
>>>>>
>>>>> The only crude solution that I can think of is to use a temporary file
>>>>> under HDFS, but even so I'm not sure if this will work.
>>>>>
>>>>> The overview of the process would be:
>>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) =>
>>>>> ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2,
>>>>> value2) => Reducer (key2, value 3)] => output.
>>>>>
>>>>> Is there a better way to pass the output from Job A as input to Job B
>>>>> (e.g. using network streams or some built in java classes that don't do
>>>>> disk i/o)?
>>>>>
>>>>>
>>>>>
>>>>> The temporary file solution will work in a single node configuration,
>>>>> but I'm not sure about an MPP config.
>>>>>
>>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>>> automagically the records between nodes or does this need to be coded
>>>>> somehow?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Venkata K Pisupat <kr...@gmail.com>.
Cascading would a good option in case you have a complex flow. However, in your case, you are trying to chain two jobs only. I would suggest you to follow these steps. 

1. The output directory of Job1 would be set at the input directory for Job2. 
2. Launch Job1 using the new API. In launcher program, instead of using Jobconf and JobClient for running job, use Job class. To run the job, invoke Job.waitForcompletion(true) on Job1. This ensures to block the program until Job1 is run completely. 
3. Optionally, you can combine the individual output files generated by each reducer (if you have more than 1 reducer task) into one or more files. 
4. Next step would be to launch Job2. 

The output of Job1 is written to HDFS and therefore, you will not have any issues while Job2 reads the input (Job1's output). 





On Sep 12, 2013, at 12:02 PM, Adrian CAPDEFIER <ch...@gmail.com> wrote:

> Thanks Bryan.
> 
> Yes, I am using hadoop + hdfs.
> 
> If I understand your point, hadoop tries to start the mapping processes on nodes where the data is local and if that's not possible, then it is hdfs that replicates the data to the mapper nodes? 
> 
> I expected to have to set up this in the code and I completely ignored HDFS; I guess it's a case of not seeing the forest from all the trees!
> 
> 
> On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <bb...@hubspot.com> wrote:
> It really comes down to the following:
> 
> In Job A set mapred.output.dir to some directory X.
> In Job B set mapred.input.dir to the same directory X.
> 
> For Job A, do context.write() as normally, and each reducer will create an output file in mapred.output.dir.  Then in Job B each of those will correspond to a mapper.
> 
> Of course you need to make sure your input and output formats, as well as input and output keys/values, match up between the two jobs as well.
> 
> If you are using HDFS, which it seems you are, the directories specified can be HDFS directories.  In that case, with a replication factor of 3, each of these output files will exist on 3 nodes.  Hadoop and HDFS will do the work to ensure that the mappers in the second job do as good a job as possible to be data or rack-local.
> 
> 
> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <ch...@gmail.com> wrote:
> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd prefer to keep, if possible, everything as close to the hadoop libraries.
> 
> I am sure I am overlooking something basic as repartitioning is a fairly common operation in MPP environments.
> 
> 
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com> wrote:
> If you want to stay in Java look at Cascading. Pig is also helpful. I think there are other (Spring integration maybe?) but I'm not familiar with them enough to make a recommendation.
> 
> Note that with Cascading and Pig you don't write 'map reduce' you write logic and they map it to the various mapper/reducer steps automatically.
> 
> Hope this helps,
> 
> Chris
> 
> 
> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com> wrote:
> Howdy,
> 
> My application requires 2 distinct processing steps (reducers) to be performed on the input data. The first operation generates changes the key values and, records that had different keys in step 1 can end up having the same key in step 2.
> 
> The heavy lifting of the operation is in step1 and step2 only combines records where keys were changed.
> 
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
> 
> 
> To implement this in hadoop, it seems that I need to create a separate job for each step. 
> 
> Now I assumed, there would some sort of job management under hadoop to link Job 1 and 2, but the only thing I could find was related to job scheduling and nothing on how to synchronize the input/output of the linked jobs.
> 
> 
> 
> The only crude solution that I can think of is to use a temporary file under HDFS, but even so I'm not sure if this will work.
> 
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer (key2, value 3)] => output.
> 
> Is there a better way to pass the output from Job A as input to Job B (e.g. using network streams or some built in java classes that don't do disk i/o)? 
> 
> 
> 
> The temporary file solution will work in a single node configuration, but I'm not sure about an MPP config.
> 
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redistribute automagically the records between nodes or does this need to be coded somehow? 
> 
> 
> 
> 


Re: chaining (the output of) jobs/ reducers

Posted by Venkata K Pisupat <kr...@gmail.com>.
Cascading would a good option in case you have a complex flow. However, in your case, you are trying to chain two jobs only. I would suggest you to follow these steps. 

1. The output directory of Job1 would be set at the input directory for Job2. 
2. Launch Job1 using the new API. In launcher program, instead of using Jobconf and JobClient for running job, use Job class. To run the job, invoke Job.waitForcompletion(true) on Job1. This ensures to block the program until Job1 is run completely. 
3. Optionally, you can combine the individual output files generated by each reducer (if you have more than 1 reducer task) into one or more files. 
4. Next step would be to launch Job2. 

The output of Job1 is written to HDFS and therefore, you will not have any issues while Job2 reads the input (Job1's output). 





On Sep 12, 2013, at 12:02 PM, Adrian CAPDEFIER <ch...@gmail.com> wrote:

> Thanks Bryan.
> 
> Yes, I am using hadoop + hdfs.
> 
> If I understand your point, hadoop tries to start the mapping processes on nodes where the data is local and if that's not possible, then it is hdfs that replicates the data to the mapper nodes? 
> 
> I expected to have to set up this in the code and I completely ignored HDFS; I guess it's a case of not seeing the forest from all the trees!
> 
> 
> On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <bb...@hubspot.com> wrote:
> It really comes down to the following:
> 
> In Job A set mapred.output.dir to some directory X.
> In Job B set mapred.input.dir to the same directory X.
> 
> For Job A, do context.write() as normally, and each reducer will create an output file in mapred.output.dir.  Then in Job B each of those will correspond to a mapper.
> 
> Of course you need to make sure your input and output formats, as well as input and output keys/values, match up between the two jobs as well.
> 
> If you are using HDFS, which it seems you are, the directories specified can be HDFS directories.  In that case, with a replication factor of 3, each of these output files will exist on 3 nodes.  Hadoop and HDFS will do the work to ensure that the mappers in the second job do as good a job as possible to be data or rack-local.
> 
> 
> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <ch...@gmail.com> wrote:
> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd prefer to keep, if possible, everything as close to the hadoop libraries.
> 
> I am sure I am overlooking something basic as repartitioning is a fairly common operation in MPP environments.
> 
> 
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com> wrote:
> If you want to stay in Java look at Cascading. Pig is also helpful. I think there are other (Spring integration maybe?) but I'm not familiar with them enough to make a recommendation.
> 
> Note that with Cascading and Pig you don't write 'map reduce' you write logic and they map it to the various mapper/reducer steps automatically.
> 
> Hope this helps,
> 
> Chris
> 
> 
> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com> wrote:
> Howdy,
> 
> My application requires 2 distinct processing steps (reducers) to be performed on the input data. The first operation generates changes the key values and, records that had different keys in step 1 can end up having the same key in step 2.
> 
> The heavy lifting of the operation is in step1 and step2 only combines records where keys were changed.
> 
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
> 
> 
> To implement this in hadoop, it seems that I need to create a separate job for each step. 
> 
> Now I assumed, there would some sort of job management under hadoop to link Job 1 and 2, but the only thing I could find was related to job scheduling and nothing on how to synchronize the input/output of the linked jobs.
> 
> 
> 
> The only crude solution that I can think of is to use a temporary file under HDFS, but even so I'm not sure if this will work.
> 
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer (key2, value 3)] => output.
> 
> Is there a better way to pass the output from Job A as input to Job B (e.g. using network streams or some built in java classes that don't do disk i/o)? 
> 
> 
> 
> The temporary file solution will work in a single node configuration, but I'm not sure about an MPP config.
> 
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redistribute automagically the records between nodes or does this need to be coded somehow? 
> 
> 
> 
> 


Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thanks Bryan.

Yes, I am using hadoop + hdfs.

If I understand your point, hadoop tries to start the mapping processes on
nodes where the data is local and if that's not possible, then it is hdfs
that replicates the data to the mapper nodes?

I expected to have to set up this in the code and I completely ignored
HDFS; I guess it's a case of not seeing the forest from all the trees!


On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> It really comes down to the following:
>
> In Job A set mapred.output.dir to some directory X.
> In Job B set mapred.input.dir to the same directory X.
>
> For Job A, do context.write() as normally, and each reducer will create an
> output file in mapred.output.dir.  Then in Job B each of those will
> correspond to a mapper.
>
> Of course you need to make sure your input and output formats, as well as
> input and output keys/values, match up between the two jobs as well.
>
> If you are using HDFS, which it seems you are, the directories specified
> can be HDFS directories.  In that case, with a replication factor of 3,
> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
> the work to ensure that the mappers in the second job do as good a job as
> possible to be data or rack-local.
>
>
> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <chivas314159@gmail.com
> > wrote:
>
>> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
>> prefer to keep, if possible, everything as close to the hadoop libraries.
>>
>> I am sure I am overlooking something basic as repartitioning is a fairly
>> common operation in MPP environments.
>>
>>
>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>
>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>> them enough to make a recommendation.
>>>
>>> Note that with Cascading and Pig you don't write 'map reduce' you write
>>> logic and they map it to the various mapper/reducer steps automatically.
>>>
>>> Hope this helps,
>>>
>>> Chris
>>>
>>>
>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Howdy,
>>>>
>>>> My application requires 2 distinct processing steps (reducers) to be
>>>> performed on the input data. The first operation generates changes the key
>>>> values and, records that had different keys in step 1 can end up having the
>>>> same key in step 2.
>>>>
>>>> The heavy lifting of the operation is in step1 and step2 only combines
>>>> records where keys were changed.
>>>>
>>>> In short the overview is:
>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>
>>>>
>>>> To implement this in hadoop, it seems that I need to create a separate
>>>> job for each step.
>>>>
>>>> Now I assumed, there would some sort of job management under hadoop to
>>>> link Job 1 and 2, but the only thing I could find was related to job
>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>> jobs.
>>>>
>>>>
>>>>
>>>> The only crude solution that I can think of is to use a temporary file
>>>> under HDFS, but even so I'm not sure if this will work.
>>>>
>>>> The overview of the process would be:
>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>>> (key2, value 3)] => output.
>>>>
>>>> Is there a better way to pass the output from Job A as input to Job B
>>>> (e.g. using network streams or some built in java classes that don't do
>>>> disk i/o)?
>>>>
>>>>
>>>>
>>>> The temporary file solution will work in a single node configuration,
>>>> but I'm not sure about an MPP config.
>>>>
>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>> automagically the records between nodes or does this need to be coded
>>>> somehow?
>>>>
>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thanks Bryan.

Yes, I am using hadoop + hdfs.

If I understand your point, hadoop tries to start the mapping processes on
nodes where the data is local and if that's not possible, then it is hdfs
that replicates the data to the mapper nodes?

I expected to have to set up this in the code and I completely ignored
HDFS; I guess it's a case of not seeing the forest from all the trees!


On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> It really comes down to the following:
>
> In Job A set mapred.output.dir to some directory X.
> In Job B set mapred.input.dir to the same directory X.
>
> For Job A, do context.write() as normally, and each reducer will create an
> output file in mapred.output.dir.  Then in Job B each of those will
> correspond to a mapper.
>
> Of course you need to make sure your input and output formats, as well as
> input and output keys/values, match up between the two jobs as well.
>
> If you are using HDFS, which it seems you are, the directories specified
> can be HDFS directories.  In that case, with a replication factor of 3,
> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
> the work to ensure that the mappers in the second job do as good a job as
> possible to be data or rack-local.
>
>
> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <chivas314159@gmail.com
> > wrote:
>
>> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
>> prefer to keep, if possible, everything as close to the hadoop libraries.
>>
>> I am sure I am overlooking something basic as repartitioning is a fairly
>> common operation in MPP environments.
>>
>>
>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>
>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>> them enough to make a recommendation.
>>>
>>> Note that with Cascading and Pig you don't write 'map reduce' you write
>>> logic and they map it to the various mapper/reducer steps automatically.
>>>
>>> Hope this helps,
>>>
>>> Chris
>>>
>>>
>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Howdy,
>>>>
>>>> My application requires 2 distinct processing steps (reducers) to be
>>>> performed on the input data. The first operation generates changes the key
>>>> values and, records that had different keys in step 1 can end up having the
>>>> same key in step 2.
>>>>
>>>> The heavy lifting of the operation is in step1 and step2 only combines
>>>> records where keys were changed.
>>>>
>>>> In short the overview is:
>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>
>>>>
>>>> To implement this in hadoop, it seems that I need to create a separate
>>>> job for each step.
>>>>
>>>> Now I assumed, there would some sort of job management under hadoop to
>>>> link Job 1 and 2, but the only thing I could find was related to job
>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>> jobs.
>>>>
>>>>
>>>>
>>>> The only crude solution that I can think of is to use a temporary file
>>>> under HDFS, but even so I'm not sure if this will work.
>>>>
>>>> The overview of the process would be:
>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>>> (key2, value 3)] => output.
>>>>
>>>> Is there a better way to pass the output from Job A as input to Job B
>>>> (e.g. using network streams or some built in java classes that don't do
>>>> disk i/o)?
>>>>
>>>>
>>>>
>>>> The temporary file solution will work in a single node configuration,
>>>> but I'm not sure about an MPP config.
>>>>
>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>> automagically the records between nodes or does this need to be coded
>>>> somehow?
>>>>
>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thanks Bryan.

Yes, I am using hadoop + hdfs.

If I understand your point, hadoop tries to start the mapping processes on
nodes where the data is local and if that's not possible, then it is hdfs
that replicates the data to the mapper nodes?

I expected to have to set up this in the code and I completely ignored
HDFS; I guess it's a case of not seeing the forest from all the trees!


On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> It really comes down to the following:
>
> In Job A set mapred.output.dir to some directory X.
> In Job B set mapred.input.dir to the same directory X.
>
> For Job A, do context.write() as normally, and each reducer will create an
> output file in mapred.output.dir.  Then in Job B each of those will
> correspond to a mapper.
>
> Of course you need to make sure your input and output formats, as well as
> input and output keys/values, match up between the two jobs as well.
>
> If you are using HDFS, which it seems you are, the directories specified
> can be HDFS directories.  In that case, with a replication factor of 3,
> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
> the work to ensure that the mappers in the second job do as good a job as
> possible to be data or rack-local.
>
>
> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <chivas314159@gmail.com
> > wrote:
>
>> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
>> prefer to keep, if possible, everything as close to the hadoop libraries.
>>
>> I am sure I am overlooking something basic as repartitioning is a fairly
>> common operation in MPP environments.
>>
>>
>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>
>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>> them enough to make a recommendation.
>>>
>>> Note that with Cascading and Pig you don't write 'map reduce' you write
>>> logic and they map it to the various mapper/reducer steps automatically.
>>>
>>> Hope this helps,
>>>
>>> Chris
>>>
>>>
>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Howdy,
>>>>
>>>> My application requires 2 distinct processing steps (reducers) to be
>>>> performed on the input data. The first operation generates changes the key
>>>> values and, records that had different keys in step 1 can end up having the
>>>> same key in step 2.
>>>>
>>>> The heavy lifting of the operation is in step1 and step2 only combines
>>>> records where keys were changed.
>>>>
>>>> In short the overview is:
>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>
>>>>
>>>> To implement this in hadoop, it seems that I need to create a separate
>>>> job for each step.
>>>>
>>>> Now I assumed, there would some sort of job management under hadoop to
>>>> link Job 1 and 2, but the only thing I could find was related to job
>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>> jobs.
>>>>
>>>>
>>>>
>>>> The only crude solution that I can think of is to use a temporary file
>>>> under HDFS, but even so I'm not sure if this will work.
>>>>
>>>> The overview of the process would be:
>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>>> (key2, value 3)] => output.
>>>>
>>>> Is there a better way to pass the output from Job A as input to Job B
>>>> (e.g. using network streams or some built in java classes that don't do
>>>> disk i/o)?
>>>>
>>>>
>>>>
>>>> The temporary file solution will work in a single node configuration,
>>>> but I'm not sure about an MPP config.
>>>>
>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>> automagically the records between nodes or does this need to be coded
>>>> somehow?
>>>>
>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thanks Bryan.

Yes, I am using hadoop + hdfs.

If I understand your point, hadoop tries to start the mapping processes on
nodes where the data is local and if that's not possible, then it is hdfs
that replicates the data to the mapper nodes?

I expected to have to set up this in the code and I completely ignored
HDFS; I guess it's a case of not seeing the forest from all the trees!


On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> It really comes down to the following:
>
> In Job A set mapred.output.dir to some directory X.
> In Job B set mapred.input.dir to the same directory X.
>
> For Job A, do context.write() as normally, and each reducer will create an
> output file in mapred.output.dir.  Then in Job B each of those will
> correspond to a mapper.
>
> Of course you need to make sure your input and output formats, as well as
> input and output keys/values, match up between the two jobs as well.
>
> If you are using HDFS, which it seems you are, the directories specified
> can be HDFS directories.  In that case, with a replication factor of 3,
> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
> the work to ensure that the mappers in the second job do as good a job as
> possible to be data or rack-local.
>
>
> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <chivas314159@gmail.com
> > wrote:
>
>> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
>> prefer to keep, if possible, everything as close to the hadoop libraries.
>>
>> I am sure I am overlooking something basic as repartitioning is a fairly
>> common operation in MPP environments.
>>
>>
>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>>
>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>> think there are other (Spring integration maybe?) but I'm not familiar with
>>> them enough to make a recommendation.
>>>
>>> Note that with Cascading and Pig you don't write 'map reduce' you write
>>> logic and they map it to the various mapper/reducer steps automatically.
>>>
>>> Hope this helps,
>>>
>>> Chris
>>>
>>>
>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Howdy,
>>>>
>>>> My application requires 2 distinct processing steps (reducers) to be
>>>> performed on the input data. The first operation generates changes the key
>>>> values and, records that had different keys in step 1 can end up having the
>>>> same key in step 2.
>>>>
>>>> The heavy lifting of the operation is in step1 and step2 only combines
>>>> records where keys were changed.
>>>>
>>>> In short the overview is:
>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>
>>>>
>>>> To implement this in hadoop, it seems that I need to create a separate
>>>> job for each step.
>>>>
>>>> Now I assumed, there would some sort of job management under hadoop to
>>>> link Job 1 and 2, but the only thing I could find was related to job
>>>> scheduling and nothing on how to synchronize the input/output of the linked
>>>> jobs.
>>>>
>>>>
>>>>
>>>> The only crude solution that I can think of is to use a temporary file
>>>> under HDFS, but even so I'm not sure if this will work.
>>>>
>>>> The overview of the process would be:
>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>>> (key2, value 3)] => output.
>>>>
>>>> Is there a better way to pass the output from Job A as input to Job B
>>>> (e.g. using network streams or some built in java classes that don't do
>>>> disk i/o)?
>>>>
>>>>
>>>>
>>>> The temporary file solution will work in a single node configuration,
>>>> but I'm not sure about an MPP config.
>>>>
>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>> automagically the records between nodes or does this need to be coded
>>>> somehow?
>>>>
>>>
>>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Bryan Beaudreault <bb...@hubspot.com>.
It really comes down to the following:

In Job A set mapred.output.dir to some directory X.
In Job B set mapred.input.dir to the same directory X.

For Job A, do context.write() as normally, and each reducer will create an
output file in mapred.output.dir.  Then in Job B each of those will
correspond to a mapper.

Of course you need to make sure your input and output formats, as well as
input and output keys/values, match up between the two jobs as well.

If you are using HDFS, which it seems you are, the directories specified
can be HDFS directories.  In that case, with a replication factor of 3,
each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
the work to ensure that the mappers in the second job do as good a job as
possible to be data or rack-local.


On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
> prefer to keep, if possible, everything as close to the hadoop libraries.
>
> I am sure I am overlooking something basic as repartitioning is a fairly
> common operation in MPP environments.
>
>
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>
>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>> think there are other (Spring integration maybe?) but I'm not familiar with
>> them enough to make a recommendation.
>>
>> Note that with Cascading and Pig you don't write 'map reduce' you write
>> logic and they map it to the various mapper/reducer steps automatically.
>>
>> Hope this helps,
>>
>> Chris
>>
>>
>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> My application requires 2 distinct processing steps (reducers) to be
>>> performed on the input data. The first operation generates changes the key
>>> values and, records that had different keys in step 1 can end up having the
>>> same key in step 2.
>>>
>>> The heavy lifting of the operation is in step1 and step2 only combines
>>> records where keys were changed.
>>>
>>> In short the overview is:
>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>
>>>
>>> To implement this in hadoop, it seems that I need to create a separate
>>> job for each step.
>>>
>>> Now I assumed, there would some sort of job management under hadoop to
>>> link Job 1 and 2, but the only thing I could find was related to job
>>> scheduling and nothing on how to synchronize the input/output of the linked
>>> jobs.
>>>
>>>
>>>
>>> The only crude solution that I can think of is to use a temporary file
>>> under HDFS, but even so I'm not sure if this will work.
>>>
>>> The overview of the process would be:
>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>> (key2, value 3)] => output.
>>>
>>> Is there a better way to pass the output from Job A as input to Job B
>>> (e.g. using network streams or some built in java classes that don't do
>>> disk i/o)?
>>>
>>>
>>>
>>> The temporary file solution will work in a single node configuration,
>>> but I'm not sure about an MPP config.
>>>
>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>>> automagically the records between nodes or does this need to be coded
>>> somehow?
>>>
>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Shahab Yunus <sh...@gmail.com>.
"The temporary file solution will work in a single node configuration, but
I'm not sure about an MPP config.

Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
both jobs run on all 4 nodes - will HDFS be able to redistribute
automagically the records between nodes or does this need to be coded
somehow?"

Correct me if I misunderstood your problem but there shouldn't be any
concern about this point. Your output of Job 1 will be on HDFS, the same
file system which your Job 2 will use to read its input from. This file
system hides from you where the data exists actually node in the cluster.
Your Job2 should read the output of Job1 from an HDFS path. Also, you can
make your second Job 2 dependent on the completion of Job 1. You can do
that in the driver code. That way your Job 2 will only run if Job1 has
finished.

Maybe I am missing something here e.g. Why are you using ChainReducer in
Job 1.

"Is there a better way to pass the output from Job A as input to Job B
(e.g. using network streams or some built in java classes that don't do
disk i/o)? "
Maybe Hadoop streaming? But then you have to construct and design your jobs
in such way which seems to me an overhead. Maybe experts can help.

Regards,
Shahab


On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
> prefer to keep, if possible, everything as close to the hadoop libraries.
>
> I am sure I am overlooking something basic as repartitioning is a fairly
> common operation in MPP environments.
>
>
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>
>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>> think there are other (Spring integration maybe?) but I'm not familiar with
>> them enough to make a recommendation.
>>
>> Note that with Cascading and Pig you don't write 'map reduce' you write
>> logic and they map it to the various mapper/reducer steps automatically.
>>
>> Hope this helps,
>>
>> Chris
>>
>>
>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> My application requires 2 distinct processing steps (reducers) to be
>>> performed on the input data. The first operation generates changes the key
>>> values and, records that had different keys in step 1 can end up having the
>>> same key in step 2.
>>>
>>> The heavy lifting of the operation is in step1 and step2 only combines
>>> records where keys were changed.
>>>
>>> In short the overview is:
>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>
>>>
>>> To implement this in hadoop, it seems that I need to create a separate
>>> job for each step.
>>>
>>> Now I assumed, there would some sort of job management under hadoop to
>>> link Job 1 and 2, but the only thing I could find was related to job
>>> scheduling and nothing on how to synchronize the input/output of the linked
>>> jobs.
>>>
>>>
>>>
>>> The only crude solution that I can think of is to use a temporary file
>>> under HDFS, but even so I'm not sure if this will work.
>>>
>>> The overview of the process would be:
>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>> (key2, value 3)] => output.
>>>
>>> Is there a better way to pass the output from Job A as input to Job B
>>> (e.g. using network streams or some built in java classes that don't do
>>> disk i/o)?
>>>
>>>
>>>
>>> The temporary file solution will work in a single node configuration,
>>> but I'm not sure about an MPP config.
>>>
>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>>> automagically the records between nodes or does this need to be coded
>>> somehow?
>>>
>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Bryan Beaudreault <bb...@hubspot.com>.
It really comes down to the following:

In Job A set mapred.output.dir to some directory X.
In Job B set mapred.input.dir to the same directory X.

For Job A, do context.write() as normally, and each reducer will create an
output file in mapred.output.dir.  Then in Job B each of those will
correspond to a mapper.

Of course you need to make sure your input and output formats, as well as
input and output keys/values, match up between the two jobs as well.

If you are using HDFS, which it seems you are, the directories specified
can be HDFS directories.  In that case, with a replication factor of 3,
each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
the work to ensure that the mappers in the second job do as good a job as
possible to be data or rack-local.


On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
> prefer to keep, if possible, everything as close to the hadoop libraries.
>
> I am sure I am overlooking something basic as repartitioning is a fairly
> common operation in MPP environments.
>
>
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>
>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>> think there are other (Spring integration maybe?) but I'm not familiar with
>> them enough to make a recommendation.
>>
>> Note that with Cascading and Pig you don't write 'map reduce' you write
>> logic and they map it to the various mapper/reducer steps automatically.
>>
>> Hope this helps,
>>
>> Chris
>>
>>
>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> My application requires 2 distinct processing steps (reducers) to be
>>> performed on the input data. The first operation generates changes the key
>>> values and, records that had different keys in step 1 can end up having the
>>> same key in step 2.
>>>
>>> The heavy lifting of the operation is in step1 and step2 only combines
>>> records where keys were changed.
>>>
>>> In short the overview is:
>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>
>>>
>>> To implement this in hadoop, it seems that I need to create a separate
>>> job for each step.
>>>
>>> Now I assumed, there would some sort of job management under hadoop to
>>> link Job 1 and 2, but the only thing I could find was related to job
>>> scheduling and nothing on how to synchronize the input/output of the linked
>>> jobs.
>>>
>>>
>>>
>>> The only crude solution that I can think of is to use a temporary file
>>> under HDFS, but even so I'm not sure if this will work.
>>>
>>> The overview of the process would be:
>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>> (key2, value 3)] => output.
>>>
>>> Is there a better way to pass the output from Job A as input to Job B
>>> (e.g. using network streams or some built in java classes that don't do
>>> disk i/o)?
>>>
>>>
>>>
>>> The temporary file solution will work in a single node configuration,
>>> but I'm not sure about an MPP config.
>>>
>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>>> automagically the records between nodes or does this need to be coded
>>> somehow?
>>>
>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Shahab Yunus <sh...@gmail.com>.
"The temporary file solution will work in a single node configuration, but
I'm not sure about an MPP config.

Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
both jobs run on all 4 nodes - will HDFS be able to redistribute
automagically the records between nodes or does this need to be coded
somehow?"

Correct me if I misunderstood your problem but there shouldn't be any
concern about this point. Your output of Job 1 will be on HDFS, the same
file system which your Job 2 will use to read its input from. This file
system hides from you where the data exists actually node in the cluster.
Your Job2 should read the output of Job1 from an HDFS path. Also, you can
make your second Job 2 dependent on the completion of Job 1. You can do
that in the driver code. That way your Job 2 will only run if Job1 has
finished.

Maybe I am missing something here e.g. Why are you using ChainReducer in
Job 1.

"Is there a better way to pass the output from Job A as input to Job B
(e.g. using network streams or some built in java classes that don't do
disk i/o)? "
Maybe Hadoop streaming? But then you have to construct and design your jobs
in such way which seems to me an overhead. Maybe experts can help.

Regards,
Shahab


On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
> prefer to keep, if possible, everything as close to the hadoop libraries.
>
> I am sure I am overlooking something basic as repartitioning is a fairly
> common operation in MPP environments.
>
>
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>
>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>> think there are other (Spring integration maybe?) but I'm not familiar with
>> them enough to make a recommendation.
>>
>> Note that with Cascading and Pig you don't write 'map reduce' you write
>> logic and they map it to the various mapper/reducer steps automatically.
>>
>> Hope this helps,
>>
>> Chris
>>
>>
>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> My application requires 2 distinct processing steps (reducers) to be
>>> performed on the input data. The first operation generates changes the key
>>> values and, records that had different keys in step 1 can end up having the
>>> same key in step 2.
>>>
>>> The heavy lifting of the operation is in step1 and step2 only combines
>>> records where keys were changed.
>>>
>>> In short the overview is:
>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>
>>>
>>> To implement this in hadoop, it seems that I need to create a separate
>>> job for each step.
>>>
>>> Now I assumed, there would some sort of job management under hadoop to
>>> link Job 1 and 2, but the only thing I could find was related to job
>>> scheduling and nothing on how to synchronize the input/output of the linked
>>> jobs.
>>>
>>>
>>>
>>> The only crude solution that I can think of is to use a temporary file
>>> under HDFS, but even so I'm not sure if this will work.
>>>
>>> The overview of the process would be:
>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>> (key2, value 3)] => output.
>>>
>>> Is there a better way to pass the output from Job A as input to Job B
>>> (e.g. using network streams or some built in java classes that don't do
>>> disk i/o)?
>>>
>>>
>>>
>>> The temporary file solution will work in a single node configuration,
>>> but I'm not sure about an MPP config.
>>>
>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>>> automagically the records between nodes or does this need to be coded
>>> somehow?
>>>
>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Shahab Yunus <sh...@gmail.com>.
"The temporary file solution will work in a single node configuration, but
I'm not sure about an MPP config.

Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
both jobs run on all 4 nodes - will HDFS be able to redistribute
automagically the records between nodes or does this need to be coded
somehow?"

Correct me if I misunderstood your problem but there shouldn't be any
concern about this point. Your output of Job 1 will be on HDFS, the same
file system which your Job 2 will use to read its input from. This file
system hides from you where the data exists actually node in the cluster.
Your Job2 should read the output of Job1 from an HDFS path. Also, you can
make your second Job 2 dependent on the completion of Job 1. You can do
that in the driver code. That way your Job 2 will only run if Job1 has
finished.

Maybe I am missing something here e.g. Why are you using ChainReducer in
Job 1.

"Is there a better way to pass the output from Job A as input to Job B
(e.g. using network streams or some built in java classes that don't do
disk i/o)? "
Maybe Hadoop streaming? But then you have to construct and design your jobs
in such way which seems to me an overhead. Maybe experts can help.

Regards,
Shahab


On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
> prefer to keep, if possible, everything as close to the hadoop libraries.
>
> I am sure I am overlooking something basic as repartitioning is a fairly
> common operation in MPP environments.
>
>
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>
>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>> think there are other (Spring integration maybe?) but I'm not familiar with
>> them enough to make a recommendation.
>>
>> Note that with Cascading and Pig you don't write 'map reduce' you write
>> logic and they map it to the various mapper/reducer steps automatically.
>>
>> Hope this helps,
>>
>> Chris
>>
>>
>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> My application requires 2 distinct processing steps (reducers) to be
>>> performed on the input data. The first operation generates changes the key
>>> values and, records that had different keys in step 1 can end up having the
>>> same key in step 2.
>>>
>>> The heavy lifting of the operation is in step1 and step2 only combines
>>> records where keys were changed.
>>>
>>> In short the overview is:
>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>
>>>
>>> To implement this in hadoop, it seems that I need to create a separate
>>> job for each step.
>>>
>>> Now I assumed, there would some sort of job management under hadoop to
>>> link Job 1 and 2, but the only thing I could find was related to job
>>> scheduling and nothing on how to synchronize the input/output of the linked
>>> jobs.
>>>
>>>
>>>
>>> The only crude solution that I can think of is to use a temporary file
>>> under HDFS, but even so I'm not sure if this will work.
>>>
>>> The overview of the process would be:
>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>> (key2, value 3)] => output.
>>>
>>> Is there a better way to pass the output from Job A as input to Job B
>>> (e.g. using network streams or some built in java classes that don't do
>>> disk i/o)?
>>>
>>>
>>>
>>> The temporary file solution will work in a single node configuration,
>>> but I'm not sure about an MPP config.
>>>
>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>>> automagically the records between nodes or does this need to be coded
>>> somehow?
>>>
>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Shahab Yunus <sh...@gmail.com>.
"The temporary file solution will work in a single node configuration, but
I'm not sure about an MPP config.

Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
both jobs run on all 4 nodes - will HDFS be able to redistribute
automagically the records between nodes or does this need to be coded
somehow?"

Correct me if I misunderstood your problem but there shouldn't be any
concern about this point. Your output of Job 1 will be on HDFS, the same
file system which your Job 2 will use to read its input from. This file
system hides from you where the data exists actually node in the cluster.
Your Job2 should read the output of Job1 from an HDFS path. Also, you can
make your second Job 2 dependent on the completion of Job 1. You can do
that in the driver code. That way your Job 2 will only run if Job1 has
finished.

Maybe I am missing something here e.g. Why are you using ChainReducer in
Job 1.

"Is there a better way to pass the output from Job A as input to Job B
(e.g. using network streams or some built in java classes that don't do
disk i/o)? "
Maybe Hadoop streaming? But then you have to construct and design your jobs
in such way which seems to me an overhead. Maybe experts can help.

Regards,
Shahab


On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
> prefer to keep, if possible, everything as close to the hadoop libraries.
>
> I am sure I am overlooking something basic as repartitioning is a fairly
> common operation in MPP environments.
>
>
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>
>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>> think there are other (Spring integration maybe?) but I'm not familiar with
>> them enough to make a recommendation.
>>
>> Note that with Cascading and Pig you don't write 'map reduce' you write
>> logic and they map it to the various mapper/reducer steps automatically.
>>
>> Hope this helps,
>>
>> Chris
>>
>>
>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> My application requires 2 distinct processing steps (reducers) to be
>>> performed on the input data. The first operation generates changes the key
>>> values and, records that had different keys in step 1 can end up having the
>>> same key in step 2.
>>>
>>> The heavy lifting of the operation is in step1 and step2 only combines
>>> records where keys were changed.
>>>
>>> In short the overview is:
>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>
>>>
>>> To implement this in hadoop, it seems that I need to create a separate
>>> job for each step.
>>>
>>> Now I assumed, there would some sort of job management under hadoop to
>>> link Job 1 and 2, but the only thing I could find was related to job
>>> scheduling and nothing on how to synchronize the input/output of the linked
>>> jobs.
>>>
>>>
>>>
>>> The only crude solution that I can think of is to use a temporary file
>>> under HDFS, but even so I'm not sure if this will work.
>>>
>>> The overview of the process would be:
>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>> (key2, value 3)] => output.
>>>
>>> Is there a better way to pass the output from Job A as input to Job B
>>> (e.g. using network streams or some built in java classes that don't do
>>> disk i/o)?
>>>
>>>
>>>
>>> The temporary file solution will work in a single node configuration,
>>> but I'm not sure about an MPP config.
>>>
>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>>> automagically the records between nodes or does this need to be coded
>>> somehow?
>>>
>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Bryan Beaudreault <bb...@hubspot.com>.
It really comes down to the following:

In Job A set mapred.output.dir to some directory X.
In Job B set mapred.input.dir to the same directory X.

For Job A, do context.write() as normally, and each reducer will create an
output file in mapred.output.dir.  Then in Job B each of those will
correspond to a mapper.

Of course you need to make sure your input and output formats, as well as
input and output keys/values, match up between the two jobs as well.

If you are using HDFS, which it seems you are, the directories specified
can be HDFS directories.  In that case, with a replication factor of 3,
each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
the work to ensure that the mappers in the second job do as good a job as
possible to be data or rack-local.


On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
> prefer to keep, if possible, everything as close to the hadoop libraries.
>
> I am sure I am overlooking something basic as repartitioning is a fairly
> common operation in MPP environments.
>
>
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>
>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>> think there are other (Spring integration maybe?) but I'm not familiar with
>> them enough to make a recommendation.
>>
>> Note that with Cascading and Pig you don't write 'map reduce' you write
>> logic and they map it to the various mapper/reducer steps automatically.
>>
>> Hope this helps,
>>
>> Chris
>>
>>
>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> My application requires 2 distinct processing steps (reducers) to be
>>> performed on the input data. The first operation generates changes the key
>>> values and, records that had different keys in step 1 can end up having the
>>> same key in step 2.
>>>
>>> The heavy lifting of the operation is in step1 and step2 only combines
>>> records where keys were changed.
>>>
>>> In short the overview is:
>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>
>>>
>>> To implement this in hadoop, it seems that I need to create a separate
>>> job for each step.
>>>
>>> Now I assumed, there would some sort of job management under hadoop to
>>> link Job 1 and 2, but the only thing I could find was related to job
>>> scheduling and nothing on how to synchronize the input/output of the linked
>>> jobs.
>>>
>>>
>>>
>>> The only crude solution that I can think of is to use a temporary file
>>> under HDFS, but even so I'm not sure if this will work.
>>>
>>> The overview of the process would be:
>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>> (key2, value 3)] => output.
>>>
>>> Is there a better way to pass the output from Job A as input to Job B
>>> (e.g. using network streams or some built in java classes that don't do
>>> disk i/o)?
>>>
>>>
>>>
>>> The temporary file solution will work in a single node configuration,
>>> but I'm not sure about an MPP config.
>>>
>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>>> automagically the records between nodes or does this need to be coded
>>> somehow?
>>>
>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Bryan Beaudreault <bb...@hubspot.com>.
It really comes down to the following:

In Job A set mapred.output.dir to some directory X.
In Job B set mapred.input.dir to the same directory X.

For Job A, do context.write() as normally, and each reducer will create an
output file in mapred.output.dir.  Then in Job B each of those will
correspond to a mapper.

Of course you need to make sure your input and output formats, as well as
input and output keys/values, match up between the two jobs as well.

If you are using HDFS, which it seems you are, the directories specified
can be HDFS directories.  In that case, with a replication factor of 3,
each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
the work to ensure that the mappers in the second job do as good a job as
possible to be data or rack-local.


On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
> prefer to keep, if possible, everything as close to the hadoop libraries.
>
> I am sure I am overlooking something basic as repartitioning is a fairly
> common operation in MPP environments.
>
>
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:
>
>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>> think there are other (Spring integration maybe?) but I'm not familiar with
>> them enough to make a recommendation.
>>
>> Note that with Cascading and Pig you don't write 'map reduce' you write
>> logic and they map it to the various mapper/reducer steps automatically.
>>
>> Hope this helps,
>>
>> Chris
>>
>>
>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> My application requires 2 distinct processing steps (reducers) to be
>>> performed on the input data. The first operation generates changes the key
>>> values and, records that had different keys in step 1 can end up having the
>>> same key in step 2.
>>>
>>> The heavy lifting of the operation is in step1 and step2 only combines
>>> records where keys were changed.
>>>
>>> In short the overview is:
>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>
>>>
>>> To implement this in hadoop, it seems that I need to create a separate
>>> job for each step.
>>>
>>> Now I assumed, there would some sort of job management under hadoop to
>>> link Job 1 and 2, but the only thing I could find was related to job
>>> scheduling and nothing on how to synchronize the input/output of the linked
>>> jobs.
>>>
>>>
>>>
>>> The only crude solution that I can think of is to use a temporary file
>>> under HDFS, but even so I'm not sure if this will work.
>>>
>>> The overview of the process would be:
>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>> (key2, value 3)] => output.
>>>
>>> Is there a better way to pass the output from Job A as input to Job B
>>> (e.g. using network streams or some built in java classes that don't do
>>> disk i/o)?
>>>
>>>
>>>
>>> The temporary file solution will work in a single node configuration,
>>> but I'm not sure about an MPP config.
>>>
>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>>> automagically the records between nodes or does this need to be coded
>>> somehow?
>>>
>>
>>
>

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
prefer to keep, if possible, everything as close to the hadoop libraries.

I am sure I am overlooking something basic as repartitioning is a fairly
common operation in MPP environments.


On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:

> If you want to stay in Java look at Cascading. Pig is also helpful. I
> think there are other (Spring integration maybe?) but I'm not familiar with
> them enough to make a recommendation.
>
> Note that with Cascading and Pig you don't write 'map reduce' you write
> logic and they map it to the various mapper/reducer steps automatically.
>
> Hope this helps,
>
> Chris
>
>
> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Howdy,
>>
>> My application requires 2 distinct processing steps (reducers) to be
>> performed on the input data. The first operation generates changes the key
>> values and, records that had different keys in step 1 can end up having the
>> same key in step 2.
>>
>> The heavy lifting of the operation is in step1 and step2 only combines
>> records where keys were changed.
>>
>> In short the overview is:
>> Sequential file -> Step 1 -> Step 2 -> Output.
>>
>>
>> To implement this in hadoop, it seems that I need to create a separate
>> job for each step.
>>
>> Now I assumed, there would some sort of job management under hadoop to
>> link Job 1 and 2, but the only thing I could find was related to job
>> scheduling and nothing on how to synchronize the input/output of the linked
>> jobs.
>>
>>
>>
>> The only crude solution that I can think of is to use a temporary file
>> under HDFS, but even so I'm not sure if this will work.
>>
>> The overview of the process would be:
>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>> (key2, value 3)] => output.
>>
>> Is there a better way to pass the output from Job A as input to Job B
>> (e.g. using network streams or some built in java classes that don't do
>> disk i/o)?
>>
>>
>>
>> The temporary file solution will work in a single node configuration, but
>> I'm not sure about an MPP config.
>>
>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>> automagically the records between nodes or does this need to be coded
>> somehow?
>>
>
>

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
prefer to keep, if possible, everything as close to the hadoop libraries.

I am sure I am overlooking something basic as repartitioning is a fairly
common operation in MPP environments.


On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:

> If you want to stay in Java look at Cascading. Pig is also helpful. I
> think there are other (Spring integration maybe?) but I'm not familiar with
> them enough to make a recommendation.
>
> Note that with Cascading and Pig you don't write 'map reduce' you write
> logic and they map it to the various mapper/reducer steps automatically.
>
> Hope this helps,
>
> Chris
>
>
> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Howdy,
>>
>> My application requires 2 distinct processing steps (reducers) to be
>> performed on the input data. The first operation generates changes the key
>> values and, records that had different keys in step 1 can end up having the
>> same key in step 2.
>>
>> The heavy lifting of the operation is in step1 and step2 only combines
>> records where keys were changed.
>>
>> In short the overview is:
>> Sequential file -> Step 1 -> Step 2 -> Output.
>>
>>
>> To implement this in hadoop, it seems that I need to create a separate
>> job for each step.
>>
>> Now I assumed, there would some sort of job management under hadoop to
>> link Job 1 and 2, but the only thing I could find was related to job
>> scheduling and nothing on how to synchronize the input/output of the linked
>> jobs.
>>
>>
>>
>> The only crude solution that I can think of is to use a temporary file
>> under HDFS, but even so I'm not sure if this will work.
>>
>> The overview of the process would be:
>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>> (key2, value 3)] => output.
>>
>> Is there a better way to pass the output from Job A as input to Job B
>> (e.g. using network streams or some built in java classes that don't do
>> disk i/o)?
>>
>>
>>
>> The temporary file solution will work in a single node configuration, but
>> I'm not sure about an MPP config.
>>
>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>> automagically the records between nodes or does this need to be coded
>> somehow?
>>
>
>

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
prefer to keep, if possible, everything as close to the hadoop libraries.

I am sure I am overlooking something basic as repartitioning is a fairly
common operation in MPP environments.


On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:

> If you want to stay in Java look at Cascading. Pig is also helpful. I
> think there are other (Spring integration maybe?) but I'm not familiar with
> them enough to make a recommendation.
>
> Note that with Cascading and Pig you don't write 'map reduce' you write
> logic and they map it to the various mapper/reducer steps automatically.
>
> Hope this helps,
>
> Chris
>
>
> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Howdy,
>>
>> My application requires 2 distinct processing steps (reducers) to be
>> performed on the input data. The first operation generates changes the key
>> values and, records that had different keys in step 1 can end up having the
>> same key in step 2.
>>
>> The heavy lifting of the operation is in step1 and step2 only combines
>> records where keys were changed.
>>
>> In short the overview is:
>> Sequential file -> Step 1 -> Step 2 -> Output.
>>
>>
>> To implement this in hadoop, it seems that I need to create a separate
>> job for each step.
>>
>> Now I assumed, there would some sort of job management under hadoop to
>> link Job 1 and 2, but the only thing I could find was related to job
>> scheduling and nothing on how to synchronize the input/output of the linked
>> jobs.
>>
>>
>>
>> The only crude solution that I can think of is to use a temporary file
>> under HDFS, but even so I'm not sure if this will work.
>>
>> The overview of the process would be:
>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>> (key2, value 3)] => output.
>>
>> Is there a better way to pass the output from Job A as input to Job B
>> (e.g. using network streams or some built in java classes that don't do
>> disk i/o)?
>>
>>
>>
>> The temporary file solution will work in a single node configuration, but
>> I'm not sure about an MPP config.
>>
>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>> automagically the records between nodes or does this need to be coded
>> somehow?
>>
>
>

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
prefer to keep, if possible, everything as close to the hadoop libraries.

I am sure I am overlooking something basic as repartitioning is a fairly
common operation in MPP environments.


On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <cu...@gmail.com>wrote:

> If you want to stay in Java look at Cascading. Pig is also helpful. I
> think there are other (Spring integration maybe?) but I'm not familiar with
> them enough to make a recommendation.
>
> Note that with Cascading and Pig you don't write 'map reduce' you write
> logic and they map it to the various mapper/reducer steps automatically.
>
> Hope this helps,
>
> Chris
>
>
> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Howdy,
>>
>> My application requires 2 distinct processing steps (reducers) to be
>> performed on the input data. The first operation generates changes the key
>> values and, records that had different keys in step 1 can end up having the
>> same key in step 2.
>>
>> The heavy lifting of the operation is in step1 and step2 only combines
>> records where keys were changed.
>>
>> In short the overview is:
>> Sequential file -> Step 1 -> Step 2 -> Output.
>>
>>
>> To implement this in hadoop, it seems that I need to create a separate
>> job for each step.
>>
>> Now I assumed, there would some sort of job management under hadoop to
>> link Job 1 and 2, but the only thing I could find was related to job
>> scheduling and nothing on how to synchronize the input/output of the linked
>> jobs.
>>
>>
>>
>> The only crude solution that I can think of is to use a temporary file
>> under HDFS, but even so I'm not sure if this will work.
>>
>> The overview of the process would be:
>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>> (key2, value 3)] => output.
>>
>> Is there a better way to pass the output from Job A as input to Job B
>> (e.g. using network streams or some built in java classes that don't do
>> disk i/o)?
>>
>>
>>
>> The temporary file solution will work in a single node configuration, but
>> I'm not sure about an MPP config.
>>
>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>> automagically the records between nodes or does this need to be coded
>> somehow?
>>
>
>

Re: chaining (the output of) jobs/ reducers

Posted by Chris Curtin <cu...@gmail.com>.
If you want to stay in Java look at Cascading. Pig is also helpful. I think
there are other (Spring integration maybe?) but I'm not familiar with them
enough to make a recommendation.

Note that with Cascading and Pig you don't write 'map reduce' you write
logic and they map it to the various mapper/reducer steps automatically.

Hope this helps,

Chris


On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
I've just seen your email, Vinod. This is the behaviour that I'd expect and
similar to other data integration tools; I will keep an eye out for it as a
long term option.


On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

>
> Other than the short term solutions that others have proposed, Apache Tez
> solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers
> and reducers, and your own custom processors - all without persisting the
> intermediate outputs to HDFS.
>
> It works on top of YARN, though the first release of Tez is yet to happen.
>
> You can learn about it more here: http://tez.incubator.apache.org/
>
> HTH,
> +Vinod
>
> On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:
>
> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
I've just seen your email, Vinod. This is the behaviour that I'd expect and
similar to other data integration tools; I will keep an eye out for it as a
long term option.


On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

>
> Other than the short term solutions that others have proposed, Apache Tez
> solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers
> and reducers, and your own custom processors - all without persisting the
> intermediate outputs to HDFS.
>
> It works on top of YARN, though the first release of Tez is yet to happen.
>
> You can learn about it more here: http://tez.incubator.apache.org/
>
> HTH,
> +Vinod
>
> On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:
>
> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
I've just seen your email, Vinod. This is the behaviour that I'd expect and
similar to other data integration tools; I will keep an eye out for it as a
long term option.


On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

>
> Other than the short term solutions that others have proposed, Apache Tez
> solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers
> and reducers, and your own custom processors - all without persisting the
> intermediate outputs to HDFS.
>
> It works on top of YARN, though the first release of Tez is yet to happen.
>
> You can learn about it more here: http://tez.incubator.apache.org/
>
> HTH,
> +Vinod
>
> On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:
>
> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: chaining (the output of) jobs/ reducers

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
I've just seen your email, Vinod. This is the behaviour that I'd expect and
similar to other data integration tools; I will keep an eye out for it as a
long term option.


On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

>
> Other than the short term solutions that others have proposed, Apache Tez
> solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers
> and reducers, and your own custom processors - all without persisting the
> intermediate outputs to HDFS.
>
> It works on top of YARN, though the first release of Tez is yet to happen.
>
> You can learn about it more here: http://tez.incubator.apache.org/
>
> HTH,
> +Vinod
>
> On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:
>
> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: chaining (the output of) jobs/ reducers

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.
Other than the short term solutions that others have proposed, Apache Tez solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers and reducers, and your own custom processors - all without persisting the intermediate outputs to HDFS.

It works on top of YARN, though the first release of Tez is yet to happen.

You can learn about it more here: http://tez.incubator.apache.org/

HTH,
+Vinod

On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:

> Howdy,
> 
> My application requires 2 distinct processing steps (reducers) to be performed on the input data. The first operation generates changes the key values and, records that had different keys in step 1 can end up having the same key in step 2.
> 
> The heavy lifting of the operation is in step1 and step2 only combines records where keys were changed.
> 
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
> 
> 
> To implement this in hadoop, it seems that I need to create a separate job for each step. 
> 
> Now I assumed, there would some sort of job management under hadoop to link Job 1 and 2, but the only thing I could find was related to job scheduling and nothing on how to synchronize the input/output of the linked jobs.
> 
> 
> 
> The only crude solution that I can think of is to use a temporary file under HDFS, but even so I'm not sure if this will work.
> 
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer (key2, value 3)] => output.
> 
> Is there a better way to pass the output from Job A as input to Job B (e.g. using network streams or some built in java classes that don't do disk i/o)? 
> 
> 
> 
> The temporary file solution will work in a single node configuration, but I'm not sure about an MPP config.
> 
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redistribute automagically the records between nodes or does this need to be coded somehow? 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: chaining (the output of) jobs/ reducers

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.
Other than the short term solutions that others have proposed, Apache Tez solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers and reducers, and your own custom processors - all without persisting the intermediate outputs to HDFS.

It works on top of YARN, though the first release of Tez is yet to happen.

You can learn about it more here: http://tez.incubator.apache.org/

HTH,
+Vinod

On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:

> Howdy,
> 
> My application requires 2 distinct processing steps (reducers) to be performed on the input data. The first operation generates changes the key values and, records that had different keys in step 1 can end up having the same key in step 2.
> 
> The heavy lifting of the operation is in step1 and step2 only combines records where keys were changed.
> 
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
> 
> 
> To implement this in hadoop, it seems that I need to create a separate job for each step. 
> 
> Now I assumed, there would some sort of job management under hadoop to link Job 1 and 2, but the only thing I could find was related to job scheduling and nothing on how to synchronize the input/output of the linked jobs.
> 
> 
> 
> The only crude solution that I can think of is to use a temporary file under HDFS, but even so I'm not sure if this will work.
> 
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer (key2, value 3)] => output.
> 
> Is there a better way to pass the output from Job A as input to Job B (e.g. using network streams or some built in java classes that don't do disk i/o)? 
> 
> 
> 
> The temporary file solution will work in a single node configuration, but I'm not sure about an MPP config.
> 
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redistribute automagically the records between nodes or does this need to be coded somehow? 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: chaining (the output of) jobs/ reducers

Posted by Chris Curtin <cu...@gmail.com>.
If you want to stay in Java look at Cascading. Pig is also helpful. I think
there are other (Spring integration maybe?) but I'm not familiar with them
enough to make a recommendation.

Note that with Cascading and Pig you don't write 'map reduce' you write
logic and they map it to the various mapper/reducer steps automatically.

Hope this helps,

Chris


On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>

Re: chaining (the output of) jobs/ reducers

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.
Other than the short term solutions that others have proposed, Apache Tez solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers and reducers, and your own custom processors - all without persisting the intermediate outputs to HDFS.

It works on top of YARN, though the first release of Tez is yet to happen.

You can learn about it more here: http://tez.incubator.apache.org/

HTH,
+Vinod

On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:

> Howdy,
> 
> My application requires 2 distinct processing steps (reducers) to be performed on the input data. The first operation generates changes the key values and, records that had different keys in step 1 can end up having the same key in step 2.
> 
> The heavy lifting of the operation is in step1 and step2 only combines records where keys were changed.
> 
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
> 
> 
> To implement this in hadoop, it seems that I need to create a separate job for each step. 
> 
> Now I assumed, there would some sort of job management under hadoop to link Job 1 and 2, but the only thing I could find was related to job scheduling and nothing on how to synchronize the input/output of the linked jobs.
> 
> 
> 
> The only crude solution that I can think of is to use a temporary file under HDFS, but even so I'm not sure if this will work.
> 
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer (key2, value 3)] => output.
> 
> Is there a better way to pass the output from Job A as input to Job B (e.g. using network streams or some built in java classes that don't do disk i/o)? 
> 
> 
> 
> The temporary file solution will work in a single node configuration, but I'm not sure about an MPP config.
> 
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redistribute automagically the records between nodes or does this need to be coded somehow? 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: chaining (the output of) jobs/ reducers

Posted by Chris Curtin <cu...@gmail.com>.
If you want to stay in Java look at Cascading. Pig is also helpful. I think
there are other (Spring integration maybe?) but I'm not familiar with them
enough to make a recommendation.

Note that with Cascading and Pig you don't write 'map reduce' you write
logic and they map it to the various mapper/reducer steps automatically.

Hope this helps,

Chris


On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>

Re: chaining (the output of) jobs/ reducers

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.
Other than the short term solutions that others have proposed, Apache Tez solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers and reducers, and your own custom processors - all without persisting the intermediate outputs to HDFS.

It works on top of YARN, though the first release of Tez is yet to happen.

You can learn about it more here: http://tez.incubator.apache.org/

HTH,
+Vinod

On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:

> Howdy,
> 
> My application requires 2 distinct processing steps (reducers) to be performed on the input data. The first operation generates changes the key values and, records that had different keys in step 1 can end up having the same key in step 2.
> 
> The heavy lifting of the operation is in step1 and step2 only combines records where keys were changed.
> 
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
> 
> 
> To implement this in hadoop, it seems that I need to create a separate job for each step. 
> 
> Now I assumed, there would some sort of job management under hadoop to link Job 1 and 2, but the only thing I could find was related to job scheduling and nothing on how to synchronize the input/output of the linked jobs.
> 
> 
> 
> The only crude solution that I can think of is to use a temporary file under HDFS, but even so I'm not sure if this will work.
> 
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer (key2, value 3)] => output.
> 
> Is there a better way to pass the output from Job A as input to Job B (e.g. using network streams or some built in java classes that don't do disk i/o)? 
> 
> 
> 
> The temporary file solution will work in a single node configuration, but I'm not sure about an MPP config.
> 
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redistribute automagically the records between nodes or does this need to be coded somehow? 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: chaining (the output of) jobs/ reducers

Posted by Chris Curtin <cu...@gmail.com>.
If you want to stay in Java look at Cascading. Pig is also helpful. I think
there are other (Spring integration maybe?) but I'm not familiar with them
enough to make a recommendation.

Note that with Cascading and Pig you don't write 'map reduce' you write
logic and they map it to the various mapper/reducer steps automatically.

Hope this helps,

Chris


On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>