You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Paul Houle <on...@gmail.com> on 2013/09/13 21:23:59 UTC

Real Multiple Outputs for Hadoop -- is this implementation correct?

Hey guys I spent some time last week thinking about Hadoop before I wrote
my own class,  RealMultipleOutputs,  that does something like what
MultipleOutputs does,  except that you can specify different hdfs paths for
the different output streams.   My pals were telling me to use Cascading or
Pig if I want this functionality,  but otherwise I was happy writing Plain
M/R jars

I wrote up the implementation here:

https://github.com/paulhoule/infovore/wiki/Real-Multiple-Outputs-in-Hadoop

And this works hand-in hand with an abstraction layer that supports unit
testing w/ Mockito

https://github.com/paulhoule/infovore/wiki/Unit-Testing-Hadoop-Mappers-and-Reducers

Anyway,  I'd appreciate anybody looking at this code and trying to poke
holes in it.  It runs OK on my tiny dev cluster in 1.0.4,  1.1.2 and in
AMZN EMR but I am wondering if I missed something.

Re: Real Multiple Outputs for Hadoop -- is this implementation correct?

Posted by Harsh J <ha...@cloudera.com>.
I took a very brief look, and the approach to use multiple OCs, one
per unique parent path from a task, seems the right thing to do. Nice
work! Do consider contributing this if its working well for you :)

On Sat, Sep 14, 2013 at 12:53 AM, Paul Houle <on...@gmail.com> wrote:
> Hey guys I spent some time last week thinking about Hadoop before I wrote my
> own class,  RealMultipleOutputs,  that does something like what
> MultipleOutputs does,  except that you can specify different hdfs paths for
> the different output streams.   My pals were telling me to use Cascading or
> Pig if I want this functionality,  but otherwise I was happy writing Plain
> M/R jars
>
> I wrote up the implementation here:
>
> https://github.com/paulhoule/infovore/wiki/Real-Multiple-Outputs-in-Hadoop
>
> And this works hand-in hand with an abstraction layer that supports unit
> testing w/ Mockito
>
> https://github.com/paulhoule/infovore/wiki/Unit-Testing-Hadoop-Mappers-and-Reducers
>
> Anyway,  I'd appreciate anybody looking at this code and trying to poke
> holes in it.  It runs OK on my tiny dev cluster in 1.0.4,  1.1.2 and in AMZN
> EMR but I am wondering if I missed something.
>
>



-- 
Harsh J

Re: Real Multiple Outputs for Hadoop -- is this implementation correct?

Posted by Harsh J <ha...@cloudera.com>.
I took a very brief look, and the approach to use multiple OCs, one
per unique parent path from a task, seems the right thing to do. Nice
work! Do consider contributing this if its working well for you :)

On Sat, Sep 14, 2013 at 12:53 AM, Paul Houle <on...@gmail.com> wrote:
> Hey guys I spent some time last week thinking about Hadoop before I wrote my
> own class,  RealMultipleOutputs,  that does something like what
> MultipleOutputs does,  except that you can specify different hdfs paths for
> the different output streams.   My pals were telling me to use Cascading or
> Pig if I want this functionality,  but otherwise I was happy writing Plain
> M/R jars
>
> I wrote up the implementation here:
>
> https://github.com/paulhoule/infovore/wiki/Real-Multiple-Outputs-in-Hadoop
>
> And this works hand-in hand with an abstraction layer that supports unit
> testing w/ Mockito
>
> https://github.com/paulhoule/infovore/wiki/Unit-Testing-Hadoop-Mappers-and-Reducers
>
> Anyway,  I'd appreciate anybody looking at this code and trying to poke
> holes in it.  It runs OK on my tiny dev cluster in 1.0.4,  1.1.2 and in AMZN
> EMR but I am wondering if I missed something.
>
>



-- 
Harsh J

Re: Real Multiple Outputs for Hadoop -- is this implementation correct?

Posted by Harsh J <ha...@cloudera.com>.
I took a very brief look, and the approach to use multiple OCs, one
per unique parent path from a task, seems the right thing to do. Nice
work! Do consider contributing this if its working well for you :)

On Sat, Sep 14, 2013 at 12:53 AM, Paul Houle <on...@gmail.com> wrote:
> Hey guys I spent some time last week thinking about Hadoop before I wrote my
> own class,  RealMultipleOutputs,  that does something like what
> MultipleOutputs does,  except that you can specify different hdfs paths for
> the different output streams.   My pals were telling me to use Cascading or
> Pig if I want this functionality,  but otherwise I was happy writing Plain
> M/R jars
>
> I wrote up the implementation here:
>
> https://github.com/paulhoule/infovore/wiki/Real-Multiple-Outputs-in-Hadoop
>
> And this works hand-in hand with an abstraction layer that supports unit
> testing w/ Mockito
>
> https://github.com/paulhoule/infovore/wiki/Unit-Testing-Hadoop-Mappers-and-Reducers
>
> Anyway,  I'd appreciate anybody looking at this code and trying to poke
> holes in it.  It runs OK on my tiny dev cluster in 1.0.4,  1.1.2 and in AMZN
> EMR but I am wondering if I missed something.
>
>



-- 
Harsh J

Re: Real Multiple Outputs for Hadoop -- is this implementation correct?

Posted by Harsh J <ha...@cloudera.com>.
I took a very brief look, and the approach to use multiple OCs, one
per unique parent path from a task, seems the right thing to do. Nice
work! Do consider contributing this if its working well for you :)

On Sat, Sep 14, 2013 at 12:53 AM, Paul Houle <on...@gmail.com> wrote:
> Hey guys I spent some time last week thinking about Hadoop before I wrote my
> own class,  RealMultipleOutputs,  that does something like what
> MultipleOutputs does,  except that you can specify different hdfs paths for
> the different output streams.   My pals were telling me to use Cascading or
> Pig if I want this functionality,  but otherwise I was happy writing Plain
> M/R jars
>
> I wrote up the implementation here:
>
> https://github.com/paulhoule/infovore/wiki/Real-Multiple-Outputs-in-Hadoop
>
> And this works hand-in hand with an abstraction layer that supports unit
> testing w/ Mockito
>
> https://github.com/paulhoule/infovore/wiki/Unit-Testing-Hadoop-Mappers-and-Reducers
>
> Anyway,  I'd appreciate anybody looking at this code and trying to poke
> holes in it.  It runs OK on my tiny dev cluster in 1.0.4,  1.1.2 and in AMZN
> EMR but I am wondering if I missed something.
>
>



-- 
Harsh J