You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Aji Janis <aj...@gmail.com> on 2013/03/04 14:29:38 UTC

Accumulo and Mapreduce

Hello,

 I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
to M2.. and so on. Finally the Reducer writes output to Accumulo.

Questions:

1) Has any one tried something like this before? Are there any workflow
control apis (in or outside of Hadoop) that can help me set up the job like
this. Or am I limited to use Quartz for this?
2) If both M2 and M3 needed to write some data to two same tables in
Accumulo, is it possible to do so? Are there any good accumulo mapreduce
jobs you can point me to? blogs/pages that I can use for reference
(starting point/best practices).

Thank you in advance for any suggestions!

Aji

Re: Accumulo and Mapreduce

Posted by Nick Dimiduk <nd...@gmail.com>.

As Ted said, my first choice would be Cascading. Second choice would be
ChainMapper. As you'll see in those search results [0], it's not available
in the "modern" mapreduce API consistently across Hadoop releases. If
you've already implemented this against the mapred API, go doe
ChainReducer. If you used mapreduce and you decided to rewrite it, I'd go
for Cascading.

-n

[0]:
https://www.google.com/search?q=hadoop+chainmapper&aq=f&oq=hadoop+chainmapper

On Mon, Mar 4, 2013 at 2:03 PM, Aji Janis <aj...@gmail.com> wrote:

> I was considering based on earlier discussions using a JobController or
> ChainMapper to do this. But like a few of you mentioned Pig, Cascade or
> Oozie might be better. So what are the use cases for them? How do I decide
> which one works best for what?
>
> Thank you all for your feedback.
>
>
>
> On Mon, Mar 4, 2013 at 2:43 PM, Ted Dunning <td...@maprtech.com> wrote:
>
>> Chaining the jobs is a fantastically inefficient solution.  If you use
>> Pig or Cascading, the optimizer will glue all of your map functions into a
>> single mapper.  The result is something like:
>>
>>     (mapper1 -> mapper2 -> mapper3) => reducer
>>
>> Here the parentheses indicate that all of the map functions are executed
>> as a single function formed by composing mapper1, mapper2, and mapper3.
>>  Writing multiple jobs to do this forces *lots* of unnecessary traffic to
>> your persistent store and lots of unnecessary synchronization.
>>
>> You can do this optimization by hand, but using a higher level language
>> is often better for maintenance.
>>
>>
>> On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:
>>
>>> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
>>> or Hive. You can do this is a couple lines of code, I suspect. Two map
>>> reduce jobs should not pose any kind of challenge with the right tools.
>>>
>>>
>>> On Monday, March 4, 2013, Sandy Ryza wrote:
>>>
>>>> Hi Aji,
>>>>
>>>> Oozie is a mature project for managing MapReduce workflows.
>>>> http://oozie.apache.org/
>>>>
>>>> -Sandy
>>>>
>>>>
>>>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>>>
>>>>> Aji,
>>>>>
>>>>> Why don't you just chain the jobs together?
>>>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>>>
>>>>> Justin
>>>>>
>>>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>>>> > Russell thanks for the link.
>>>>> >
>>>>> > I am interested in finding a solution (if out there) where Mapper1
>>>>> outputs a
>>>>> > custom object and Mapper 2 can use that as input. One way to do this
>>>>> > obviously by writing to Accumulo, in my case. But, is there another
>>>>> solution
>>>>> > for this:
>>>>> >
>>>>> > List<MyObject> ----> Input to Job
>>>>> >
>>>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>>>> <MyObjectId,
>>>>> > MyObject>
>>>>> >
>>>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>>>> >
>>>>> >
>>>>> >
>>>>> > Ideas?
>>>>> >
>>>>> >
>>>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>>>> russell.jurney@gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>>>> >>
>>>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be
>>>>> to try
>>>>> >> it.
>>>>> >>
>>>>> >> Russell Jurney http://datasyndrome.com
>>>>> >>
>>>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>>>> >>
>>>>> >> Hello,
>>>>> >>
>>>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>>>> output goes
>>>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>>>> >>
>>>>> >> Questions:
>>>>> >>
>>>>> >> 1) Has any one tried something like this before? Are there any
>>>>> workflow
>>>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>>>> job like
>>>>> >> this. Or am I limited to use Quartz for this?
>>>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>>>> mapreduce
>>>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>>>> (starting
>>>>> >> point/best practices).
>>>>> >>
>>>>> >> Thank you in advance for any suggestions!
>>>>> >>
>>>>> >> Aji
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome
>>> .com
>>>
>>
>>
>

Re: Accumulo and Mapreduce

Posted by Nick Dimiduk <nd...@gmail.com>.

As Ted said, my first choice would be Cascading. Second choice would be
ChainMapper. As you'll see in those search results [0], it's not available
in the "modern" mapreduce API consistently across Hadoop releases. If
you've already implemented this against the mapred API, go doe
ChainReducer. If you used mapreduce and you decided to rewrite it, I'd go
for Cascading.

-n

[0]:
https://www.google.com/search?q=hadoop+chainmapper&aq=f&oq=hadoop+chainmapper

On Mon, Mar 4, 2013 at 2:03 PM, Aji Janis <aj...@gmail.com> wrote:

> I was considering based on earlier discussions using a JobController or
> ChainMapper to do this. But like a few of you mentioned Pig, Cascade or
> Oozie might be better. So what are the use cases for them? How do I decide
> which one works best for what?
>
> Thank you all for your feedback.
>
>
>
> On Mon, Mar 4, 2013 at 2:43 PM, Ted Dunning <td...@maprtech.com> wrote:
>
>> Chaining the jobs is a fantastically inefficient solution.  If you use
>> Pig or Cascading, the optimizer will glue all of your map functions into a
>> single mapper.  The result is something like:
>>
>>     (mapper1 -> mapper2 -> mapper3) => reducer
>>
>> Here the parentheses indicate that all of the map functions are executed
>> as a single function formed by composing mapper1, mapper2, and mapper3.
>>  Writing multiple jobs to do this forces *lots* of unnecessary traffic to
>> your persistent store and lots of unnecessary synchronization.
>>
>> You can do this optimization by hand, but using a higher level language
>> is often better for maintenance.
>>
>>
>> On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:
>>
>>> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
>>> or Hive. You can do this is a couple lines of code, I suspect. Two map
>>> reduce jobs should not pose any kind of challenge with the right tools.
>>>
>>>
>>> On Monday, March 4, 2013, Sandy Ryza wrote:
>>>
>>>> Hi Aji,
>>>>
>>>> Oozie is a mature project for managing MapReduce workflows.
>>>> http://oozie.apache.org/
>>>>
>>>> -Sandy
>>>>
>>>>
>>>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>>>
>>>>> Aji,
>>>>>
>>>>> Why don't you just chain the jobs together?
>>>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>>>
>>>>> Justin
>>>>>
>>>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>>>> > Russell thanks for the link.
>>>>> >
>>>>> > I am interested in finding a solution (if out there) where Mapper1
>>>>> outputs a
>>>>> > custom object and Mapper 2 can use that as input. One way to do this
>>>>> > obviously by writing to Accumulo, in my case. But, is there another
>>>>> solution
>>>>> > for this:
>>>>> >
>>>>> > List<MyObject> ----> Input to Job
>>>>> >
>>>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>>>> <MyObjectId,
>>>>> > MyObject>
>>>>> >
>>>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>>>> >
>>>>> >
>>>>> >
>>>>> > Ideas?
>>>>> >
>>>>> >
>>>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>>>> russell.jurney@gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>>>> >>
>>>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be
>>>>> to try
>>>>> >> it.
>>>>> >>
>>>>> >> Russell Jurney http://datasyndrome.com
>>>>> >>
>>>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>>>> >>
>>>>> >> Hello,
>>>>> >>
>>>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>>>> output goes
>>>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>>>> >>
>>>>> >> Questions:
>>>>> >>
>>>>> >> 1) Has any one tried something like this before? Are there any
>>>>> workflow
>>>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>>>> job like
>>>>> >> this. Or am I limited to use Quartz for this?
>>>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>>>> mapreduce
>>>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>>>> (starting
>>>>> >> point/best practices).
>>>>> >>
>>>>> >> Thank you in advance for any suggestions!
>>>>> >>
>>>>> >> Aji
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome
>>> .com
>>>
>>
>>
>

Re: Accumulo and Mapreduce

Posted by Nick Dimiduk <nd...@gmail.com>.

As Ted said, my first choice would be Cascading. Second choice would be
ChainMapper. As you'll see in those search results [0], it's not available
in the "modern" mapreduce API consistently across Hadoop releases. If
you've already implemented this against the mapred API, go doe
ChainReducer. If you used mapreduce and you decided to rewrite it, I'd go
for Cascading.

-n

[0]:
https://www.google.com/search?q=hadoop+chainmapper&aq=f&oq=hadoop+chainmapper

On Mon, Mar 4, 2013 at 2:03 PM, Aji Janis <aj...@gmail.com> wrote:

> I was considering based on earlier discussions using a JobController or
> ChainMapper to do this. But like a few of you mentioned Pig, Cascade or
> Oozie might be better. So what are the use cases for them? How do I decide
> which one works best for what?
>
> Thank you all for your feedback.
>
>
>
> On Mon, Mar 4, 2013 at 2:43 PM, Ted Dunning <td...@maprtech.com> wrote:
>
>> Chaining the jobs is a fantastically inefficient solution.  If you use
>> Pig or Cascading, the optimizer will glue all of your map functions into a
>> single mapper.  The result is something like:
>>
>>     (mapper1 -> mapper2 -> mapper3) => reducer
>>
>> Here the parentheses indicate that all of the map functions are executed
>> as a single function formed by composing mapper1, mapper2, and mapper3.
>>  Writing multiple jobs to do this forces *lots* of unnecessary traffic to
>> your persistent store and lots of unnecessary synchronization.
>>
>> You can do this optimization by hand, but using a higher level language
>> is often better for maintenance.
>>
>>
>> On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:
>>
>>> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
>>> or Hive. You can do this is a couple lines of code, I suspect. Two map
>>> reduce jobs should not pose any kind of challenge with the right tools.
>>>
>>>
>>> On Monday, March 4, 2013, Sandy Ryza wrote:
>>>
>>>> Hi Aji,
>>>>
>>>> Oozie is a mature project for managing MapReduce workflows.
>>>> http://oozie.apache.org/
>>>>
>>>> -Sandy
>>>>
>>>>
>>>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>>>
>>>>> Aji,
>>>>>
>>>>> Why don't you just chain the jobs together?
>>>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>>>
>>>>> Justin
>>>>>
>>>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>>>> > Russell thanks for the link.
>>>>> >
>>>>> > I am interested in finding a solution (if out there) where Mapper1
>>>>> outputs a
>>>>> > custom object and Mapper 2 can use that as input. One way to do this
>>>>> > obviously by writing to Accumulo, in my case. But, is there another
>>>>> solution
>>>>> > for this:
>>>>> >
>>>>> > List<MyObject> ----> Input to Job
>>>>> >
>>>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>>>> <MyObjectId,
>>>>> > MyObject>
>>>>> >
>>>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>>>> >
>>>>> >
>>>>> >
>>>>> > Ideas?
>>>>> >
>>>>> >
>>>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>>>> russell.jurney@gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>>>> >>
>>>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be
>>>>> to try
>>>>> >> it.
>>>>> >>
>>>>> >> Russell Jurney http://datasyndrome.com
>>>>> >>
>>>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>>>> >>
>>>>> >> Hello,
>>>>> >>
>>>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>>>> output goes
>>>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>>>> >>
>>>>> >> Questions:
>>>>> >>
>>>>> >> 1) Has any one tried something like this before? Are there any
>>>>> workflow
>>>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>>>> job like
>>>>> >> this. Or am I limited to use Quartz for this?
>>>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>>>> mapreduce
>>>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>>>> (starting
>>>>> >> point/best practices).
>>>>> >>
>>>>> >> Thank you in advance for any suggestions!
>>>>> >>
>>>>> >> Aji
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome
>>> .com
>>>
>>
>>
>

Re: Accumulo and Mapreduce

Posted by Nick Dimiduk <nd...@gmail.com>.

As Ted said, my first choice would be Cascading. Second choice would be
ChainMapper. As you'll see in those search results [0], it's not available
in the "modern" mapreduce API consistently across Hadoop releases. If
you've already implemented this against the mapred API, go doe
ChainReducer. If you used mapreduce and you decided to rewrite it, I'd go
for Cascading.

-n

[0]:
https://www.google.com/search?q=hadoop+chainmapper&aq=f&oq=hadoop+chainmapper

On Mon, Mar 4, 2013 at 2:03 PM, Aji Janis <aj...@gmail.com> wrote:

> I was considering based on earlier discussions using a JobController or
> ChainMapper to do this. But like a few of you mentioned Pig, Cascade or
> Oozie might be better. So what are the use cases for them? How do I decide
> which one works best for what?
>
> Thank you all for your feedback.
>
>
>
> On Mon, Mar 4, 2013 at 2:43 PM, Ted Dunning <td...@maprtech.com> wrote:
>
>> Chaining the jobs is a fantastically inefficient solution.  If you use
>> Pig or Cascading, the optimizer will glue all of your map functions into a
>> single mapper.  The result is something like:
>>
>>     (mapper1 -> mapper2 -> mapper3) => reducer
>>
>> Here the parentheses indicate that all of the map functions are executed
>> as a single function formed by composing mapper1, mapper2, and mapper3.
>>  Writing multiple jobs to do this forces *lots* of unnecessary traffic to
>> your persistent store and lots of unnecessary synchronization.
>>
>> You can do this optimization by hand, but using a higher level language
>> is often better for maintenance.
>>
>>
>> On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:
>>
>>> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
>>> or Hive. You can do this is a couple lines of code, I suspect. Two map
>>> reduce jobs should not pose any kind of challenge with the right tools.
>>>
>>>
>>> On Monday, March 4, 2013, Sandy Ryza wrote:
>>>
>>>> Hi Aji,
>>>>
>>>> Oozie is a mature project for managing MapReduce workflows.
>>>> http://oozie.apache.org/
>>>>
>>>> -Sandy
>>>>
>>>>
>>>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>>>
>>>>> Aji,
>>>>>
>>>>> Why don't you just chain the jobs together?
>>>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>>>
>>>>> Justin
>>>>>
>>>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>>>> > Russell thanks for the link.
>>>>> >
>>>>> > I am interested in finding a solution (if out there) where Mapper1
>>>>> outputs a
>>>>> > custom object and Mapper 2 can use that as input. One way to do this
>>>>> > obviously by writing to Accumulo, in my case. But, is there another
>>>>> solution
>>>>> > for this:
>>>>> >
>>>>> > List<MyObject> ----> Input to Job
>>>>> >
>>>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>>>> <MyObjectId,
>>>>> > MyObject>
>>>>> >
>>>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>>>> >
>>>>> >
>>>>> >
>>>>> > Ideas?
>>>>> >
>>>>> >
>>>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>>>> russell.jurney@gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>>>> >>
>>>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be
>>>>> to try
>>>>> >> it.
>>>>> >>
>>>>> >> Russell Jurney http://datasyndrome.com
>>>>> >>
>>>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>>>> >>
>>>>> >> Hello,
>>>>> >>
>>>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>>>> output goes
>>>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>>>> >>
>>>>> >> Questions:
>>>>> >>
>>>>> >> 1) Has any one tried something like this before? Are there any
>>>>> workflow
>>>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>>>> job like
>>>>> >> this. Or am I limited to use Quartz for this?
>>>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>>>> mapreduce
>>>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>>>> (starting
>>>>> >> point/best practices).
>>>>> >>
>>>>> >> Thank you in advance for any suggestions!
>>>>> >>
>>>>> >> Aji
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome
>>> .com
>>>
>>
>>
>

Re: Accumulo and Mapreduce

Posted by Aji Janis <aj...@gmail.com>.

I was considering based on earlier discussions using a JobController or
ChainMapper to do this. But like a few of you mentioned Pig, Cascade or
Oozie might be better. So what are the use cases for them? How do I decide
which one works best for what?

Thank you all for your feedback.



On Mon, Mar 4, 2013 at 2:43 PM, Ted Dunning <td...@maprtech.com> wrote:

> Chaining the jobs is a fantastically inefficient solution.  If you use Pig
> or Cascading, the optimizer will glue all of your map functions into a
> single mapper.  The result is something like:
>
>     (mapper1 -> mapper2 -> mapper3) => reducer
>
> Here the parentheses indicate that all of the map functions are executed
> as a single function formed by composing mapper1, mapper2, and mapper3.
>  Writing multiple jobs to do this forces *lots* of unnecessary traffic to
> your persistent store and lots of unnecessary synchronization.
>
> You can do this optimization by hand, but using a higher level language is
> often better for maintenance.
>
>
> On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
>> or Hive. You can do this is a couple lines of code, I suspect. Two map
>> reduce jobs should not pose any kind of challenge with the right tools.
>>
>>
>> On Monday, March 4, 2013, Sandy Ryza wrote:
>>
>>> Hi Aji,
>>>
>>> Oozie is a mature project for managing MapReduce workflows.
>>> http://oozie.apache.org/
>>>
>>> -Sandy
>>>
>>>
>>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>>
>>>> Aji,
>>>>
>>>> Why don't you just chain the jobs together?
>>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>>
>>>> Justin
>>>>
>>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>>> > Russell thanks for the link.
>>>> >
>>>> > I am interested in finding a solution (if out there) where Mapper1
>>>> outputs a
>>>> > custom object and Mapper 2 can use that as input. One way to do this
>>>> > obviously by writing to Accumulo, in my case. But, is there another
>>>> solution
>>>> > for this:
>>>> >
>>>> > List<MyObject> ----> Input to Job
>>>> >
>>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>>> <MyObjectId,
>>>> > MyObject>
>>>> >
>>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>>> >
>>>> >
>>>> >
>>>> > Ideas?
>>>> >
>>>> >
>>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>>> russell.jurney@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>>> >>
>>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>>>> try
>>>> >> it.
>>>> >>
>>>> >> Russell Jurney http://datasyndrome.com
>>>> >>
>>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>>> >>
>>>> >> Hello,
>>>> >>
>>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>>> output goes
>>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>>> >>
>>>> >> Questions:
>>>> >>
>>>> >> 1) Has any one tried something like this before? Are there any
>>>> workflow
>>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>>> job like
>>>> >> this. Or am I limited to use Quartz for this?
>>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>>> mapreduce
>>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>>> (starting
>>>> >> point/best practices).
>>>> >>
>>>> >> Thank you in advance for any suggestions!
>>>> >>
>>>> >> Aji
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
>> com
>>
>
>

Re: Accumulo and Mapreduce

Posted by Aji Janis <aj...@gmail.com>.

I was considering based on earlier discussions using a JobController or
ChainMapper to do this. But like a few of you mentioned Pig, Cascade or
Oozie might be better. So what are the use cases for them? How do I decide
which one works best for what?

Thank you all for your feedback.



On Mon, Mar 4, 2013 at 2:43 PM, Ted Dunning <td...@maprtech.com> wrote:

> Chaining the jobs is a fantastically inefficient solution.  If you use Pig
> or Cascading, the optimizer will glue all of your map functions into a
> single mapper.  The result is something like:
>
>     (mapper1 -> mapper2 -> mapper3) => reducer
>
> Here the parentheses indicate that all of the map functions are executed
> as a single function formed by composing mapper1, mapper2, and mapper3.
>  Writing multiple jobs to do this forces *lots* of unnecessary traffic to
> your persistent store and lots of unnecessary synchronization.
>
> You can do this optimization by hand, but using a higher level language is
> often better for maintenance.
>
>
> On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
>> or Hive. You can do this is a couple lines of code, I suspect. Two map
>> reduce jobs should not pose any kind of challenge with the right tools.
>>
>>
>> On Monday, March 4, 2013, Sandy Ryza wrote:
>>
>>> Hi Aji,
>>>
>>> Oozie is a mature project for managing MapReduce workflows.
>>> http://oozie.apache.org/
>>>
>>> -Sandy
>>>
>>>
>>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>>
>>>> Aji,
>>>>
>>>> Why don't you just chain the jobs together?
>>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>>
>>>> Justin
>>>>
>>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>>> > Russell thanks for the link.
>>>> >
>>>> > I am interested in finding a solution (if out there) where Mapper1
>>>> outputs a
>>>> > custom object and Mapper 2 can use that as input. One way to do this
>>>> > obviously by writing to Accumulo, in my case. But, is there another
>>>> solution
>>>> > for this:
>>>> >
>>>> > List<MyObject> ----> Input to Job
>>>> >
>>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>>> <MyObjectId,
>>>> > MyObject>
>>>> >
>>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>>> >
>>>> >
>>>> >
>>>> > Ideas?
>>>> >
>>>> >
>>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>>> russell.jurney@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>>> >>
>>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>>>> try
>>>> >> it.
>>>> >>
>>>> >> Russell Jurney http://datasyndrome.com
>>>> >>
>>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>>> >>
>>>> >> Hello,
>>>> >>
>>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>>> output goes
>>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>>> >>
>>>> >> Questions:
>>>> >>
>>>> >> 1) Has any one tried something like this before? Are there any
>>>> workflow
>>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>>> job like
>>>> >> this. Or am I limited to use Quartz for this?
>>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>>> mapreduce
>>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>>> (starting
>>>> >> point/best practices).
>>>> >>
>>>> >> Thank you in advance for any suggestions!
>>>> >>
>>>> >> Aji
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
>> com
>>
>
>

Re: Accumulo and Mapreduce

Posted by Aji Janis <aj...@gmail.com>.

I was considering based on earlier discussions using a JobController or
ChainMapper to do this. But like a few of you mentioned Pig, Cascade or
Oozie might be better. So what are the use cases for them? How do I decide
which one works best for what?

Thank you all for your feedback.



On Mon, Mar 4, 2013 at 2:43 PM, Ted Dunning <td...@maprtech.com> wrote:

> Chaining the jobs is a fantastically inefficient solution.  If you use Pig
> or Cascading, the optimizer will glue all of your map functions into a
> single mapper.  The result is something like:
>
>     (mapper1 -> mapper2 -> mapper3) => reducer
>
> Here the parentheses indicate that all of the map functions are executed
> as a single function formed by composing mapper1, mapper2, and mapper3.
>  Writing multiple jobs to do this forces *lots* of unnecessary traffic to
> your persistent store and lots of unnecessary synchronization.
>
> You can do this optimization by hand, but using a higher level language is
> often better for maintenance.
>
>
> On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
>> or Hive. You can do this is a couple lines of code, I suspect. Two map
>> reduce jobs should not pose any kind of challenge with the right tools.
>>
>>
>> On Monday, March 4, 2013, Sandy Ryza wrote:
>>
>>> Hi Aji,
>>>
>>> Oozie is a mature project for managing MapReduce workflows.
>>> http://oozie.apache.org/
>>>
>>> -Sandy
>>>
>>>
>>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>>
>>>> Aji,
>>>>
>>>> Why don't you just chain the jobs together?
>>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>>
>>>> Justin
>>>>
>>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>>> > Russell thanks for the link.
>>>> >
>>>> > I am interested in finding a solution (if out there) where Mapper1
>>>> outputs a
>>>> > custom object and Mapper 2 can use that as input. One way to do this
>>>> > obviously by writing to Accumulo, in my case. But, is there another
>>>> solution
>>>> > for this:
>>>> >
>>>> > List<MyObject> ----> Input to Job
>>>> >
>>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>>> <MyObjectId,
>>>> > MyObject>
>>>> >
>>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>>> >
>>>> >
>>>> >
>>>> > Ideas?
>>>> >
>>>> >
>>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>>> russell.jurney@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>>> >>
>>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>>>> try
>>>> >> it.
>>>> >>
>>>> >> Russell Jurney http://datasyndrome.com
>>>> >>
>>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>>> >>
>>>> >> Hello,
>>>> >>
>>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>>> output goes
>>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>>> >>
>>>> >> Questions:
>>>> >>
>>>> >> 1) Has any one tried something like this before? Are there any
>>>> workflow
>>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>>> job like
>>>> >> this. Or am I limited to use Quartz for this?
>>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>>> mapreduce
>>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>>> (starting
>>>> >> point/best practices).
>>>> >>
>>>> >> Thank you in advance for any suggestions!
>>>> >>
>>>> >> Aji
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
>> com
>>
>
>

Re: Accumulo and Mapreduce

Posted by Aji Janis <aj...@gmail.com>.

I was considering based on earlier discussions using a JobController or
ChainMapper to do this. But like a few of you mentioned Pig, Cascade or
Oozie might be better. So what are the use cases for them? How do I decide
which one works best for what?

Thank you all for your feedback.



On Mon, Mar 4, 2013 at 2:43 PM, Ted Dunning <td...@maprtech.com> wrote:

> Chaining the jobs is a fantastically inefficient solution.  If you use Pig
> or Cascading, the optimizer will glue all of your map functions into a
> single mapper.  The result is something like:
>
>     (mapper1 -> mapper2 -> mapper3) => reducer
>
> Here the parentheses indicate that all of the map functions are executed
> as a single function formed by composing mapper1, mapper2, and mapper3.
>  Writing multiple jobs to do this forces *lots* of unnecessary traffic to
> your persistent store and lots of unnecessary synchronization.
>
> You can do this optimization by hand, but using a higher level language is
> often better for maintenance.
>
>
> On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
>> or Hive. You can do this is a couple lines of code, I suspect. Two map
>> reduce jobs should not pose any kind of challenge with the right tools.
>>
>>
>> On Monday, March 4, 2013, Sandy Ryza wrote:
>>
>>> Hi Aji,
>>>
>>> Oozie is a mature project for managing MapReduce workflows.
>>> http://oozie.apache.org/
>>>
>>> -Sandy
>>>
>>>
>>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>>
>>>> Aji,
>>>>
>>>> Why don't you just chain the jobs together?
>>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>>
>>>> Justin
>>>>
>>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>>> > Russell thanks for the link.
>>>> >
>>>> > I am interested in finding a solution (if out there) where Mapper1
>>>> outputs a
>>>> > custom object and Mapper 2 can use that as input. One way to do this
>>>> > obviously by writing to Accumulo, in my case. But, is there another
>>>> solution
>>>> > for this:
>>>> >
>>>> > List<MyObject> ----> Input to Job
>>>> >
>>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>>> <MyObjectId,
>>>> > MyObject>
>>>> >
>>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>>> >
>>>> >
>>>> >
>>>> > Ideas?
>>>> >
>>>> >
>>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>>> russell.jurney@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>>> >>
>>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>>>> try
>>>> >> it.
>>>> >>
>>>> >> Russell Jurney http://datasyndrome.com
>>>> >>
>>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>>> >>
>>>> >> Hello,
>>>> >>
>>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>>> output goes
>>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>>> >>
>>>> >> Questions:
>>>> >>
>>>> >> 1) Has any one tried something like this before? Are there any
>>>> workflow
>>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>>> job like
>>>> >> this. Or am I limited to use Quartz for this?
>>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>>> mapreduce
>>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>>> (starting
>>>> >> point/best practices).
>>>> >>
>>>> >> Thank you in advance for any suggestions!
>>>> >>
>>>> >> Aji
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
>> com
>>
>
>

Re: Accumulo and Mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

Chaining the jobs is a fantastically inefficient solution.  If you use Pig
or Cascading, the optimizer will glue all of your map functions into a
single mapper.  The result is something like:

    (mapper1 -> mapper2 -> mapper3) => reducer

Here the parentheses indicate that all of the map functions are executed as
a single function formed by composing mapper1, mapper2, and mapper3.
 Writing multiple jobs to do this forces *lots* of unnecessary traffic to
your persistent store and lots of unnecessary synchronization.

You can do this optimization by hand, but using a higher level language is
often better for maintenance.


On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:

> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
> or Hive. You can do this is a couple lines of code, I suspect. Two map
> reduce jobs should not pose any kind of challenge with the right tools.
>
>
> On Monday, March 4, 2013, Sandy Ryza wrote:
>
>> Hi Aji,
>>
>> Oozie is a mature project for managing MapReduce workflows.
>> http://oozie.apache.org/
>>
>> -Sandy
>>
>>
>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>
>>> Aji,
>>>
>>> Why don't you just chain the jobs together?
>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>
>>> Justin
>>>
>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>> > Russell thanks for the link.
>>> >
>>> > I am interested in finding a solution (if out there) where Mapper1
>>> outputs a
>>> > custom object and Mapper 2 can use that as input. One way to do this
>>> > obviously by writing to Accumulo, in my case. But, is there another
>>> solution
>>> > for this:
>>> >
>>> > List<MyObject> ----> Input to Job
>>> >
>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>> <MyObjectId,
>>> > MyObject>
>>> >
>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>> >
>>> >
>>> >
>>> > Ideas?
>>> >
>>> >
>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>> russell.jurney@gmail.com>
>>> > wrote:
>>> >>
>>> >>
>>> >>
>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>> >>
>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>>> try
>>> >> it.
>>> >>
>>> >> Russell Jurney http://datasyndrome.com
>>> >>
>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>> output goes
>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>> >>
>>> >> Questions:
>>> >>
>>> >> 1) Has any one tried something like this before? Are there any
>>> workflow
>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>> job like
>>> >> this. Or am I limited to use Quartz for this?
>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>> mapreduce
>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>> (starting
>>> >> point/best practices).
>>> >>
>>> >> Thank you in advance for any suggestions!
>>> >>
>>> >> Aji
>>> >>
>>> >
>>>
>>
>>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>

Re: Accumulo and Mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

Chaining the jobs is a fantastically inefficient solution.  If you use Pig
or Cascading, the optimizer will glue all of your map functions into a
single mapper.  The result is something like:

    (mapper1 -> mapper2 -> mapper3) => reducer

Here the parentheses indicate that all of the map functions are executed as
a single function formed by composing mapper1, mapper2, and mapper3.
 Writing multiple jobs to do this forces *lots* of unnecessary traffic to
your persistent store and lots of unnecessary synchronization.

You can do this optimization by hand, but using a higher level language is
often better for maintenance.


On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:

> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
> or Hive. You can do this is a couple lines of code, I suspect. Two map
> reduce jobs should not pose any kind of challenge with the right tools.
>
>
> On Monday, March 4, 2013, Sandy Ryza wrote:
>
>> Hi Aji,
>>
>> Oozie is a mature project for managing MapReduce workflows.
>> http://oozie.apache.org/
>>
>> -Sandy
>>
>>
>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>
>>> Aji,
>>>
>>> Why don't you just chain the jobs together?
>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>
>>> Justin
>>>
>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>> > Russell thanks for the link.
>>> >
>>> > I am interested in finding a solution (if out there) where Mapper1
>>> outputs a
>>> > custom object and Mapper 2 can use that as input. One way to do this
>>> > obviously by writing to Accumulo, in my case. But, is there another
>>> solution
>>> > for this:
>>> >
>>> > List<MyObject> ----> Input to Job
>>> >
>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>> <MyObjectId,
>>> > MyObject>
>>> >
>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>> >
>>> >
>>> >
>>> > Ideas?
>>> >
>>> >
>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>> russell.jurney@gmail.com>
>>> > wrote:
>>> >>
>>> >>
>>> >>
>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>> >>
>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>>> try
>>> >> it.
>>> >>
>>> >> Russell Jurney http://datasyndrome.com
>>> >>
>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>> output goes
>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>> >>
>>> >> Questions:
>>> >>
>>> >> 1) Has any one tried something like this before? Are there any
>>> workflow
>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>> job like
>>> >> this. Or am I limited to use Quartz for this?
>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>> mapreduce
>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>> (starting
>>> >> point/best practices).
>>> >>
>>> >> Thank you in advance for any suggestions!
>>> >>
>>> >> Aji
>>> >>
>>> >
>>>
>>
>>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>

Re: Accumulo and Mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

Chaining the jobs is a fantastically inefficient solution.  If you use Pig
or Cascading, the optimizer will glue all of your map functions into a
single mapper.  The result is something like:

    (mapper1 -> mapper2 -> mapper3) => reducer

Here the parentheses indicate that all of the map functions are executed as
a single function formed by composing mapper1, mapper2, and mapper3.
 Writing multiple jobs to do this forces *lots* of unnecessary traffic to
your persistent store and lots of unnecessary synchronization.

You can do this optimization by hand, but using a higher level language is
often better for maintenance.


On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:

> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
> or Hive. You can do this is a couple lines of code, I suspect. Two map
> reduce jobs should not pose any kind of challenge with the right tools.
>
>
> On Monday, March 4, 2013, Sandy Ryza wrote:
>
>> Hi Aji,
>>
>> Oozie is a mature project for managing MapReduce workflows.
>> http://oozie.apache.org/
>>
>> -Sandy
>>
>>
>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>
>>> Aji,
>>>
>>> Why don't you just chain the jobs together?
>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>
>>> Justin
>>>
>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>> > Russell thanks for the link.
>>> >
>>> > I am interested in finding a solution (if out there) where Mapper1
>>> outputs a
>>> > custom object and Mapper 2 can use that as input. One way to do this
>>> > obviously by writing to Accumulo, in my case. But, is there another
>>> solution
>>> > for this:
>>> >
>>> > List<MyObject> ----> Input to Job
>>> >
>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>> <MyObjectId,
>>> > MyObject>
>>> >
>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>> >
>>> >
>>> >
>>> > Ideas?
>>> >
>>> >
>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>> russell.jurney@gmail.com>
>>> > wrote:
>>> >>
>>> >>
>>> >>
>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>> >>
>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>>> try
>>> >> it.
>>> >>
>>> >> Russell Jurney http://datasyndrome.com
>>> >>
>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>> output goes
>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>> >>
>>> >> Questions:
>>> >>
>>> >> 1) Has any one tried something like this before? Are there any
>>> workflow
>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>> job like
>>> >> this. Or am I limited to use Quartz for this?
>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>> mapreduce
>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>> (starting
>>> >> point/best practices).
>>> >>
>>> >> Thank you in advance for any suggestions!
>>> >>
>>> >> Aji
>>> >>
>>> >
>>>
>>
>>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>

Re: Accumulo and Mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

Chaining the jobs is a fantastically inefficient solution.  If you use Pig
or Cascading, the optimizer will glue all of your map functions into a
single mapper.  The result is something like:

    (mapper1 -> mapper2 -> mapper3) => reducer

Here the parentheses indicate that all of the map functions are executed as
a single function formed by composing mapper1, mapper2, and mapper3.
 Writing multiple jobs to do this forces *lots* of unnecessary traffic to
your persistent store and lots of unnecessary synchronization.

You can do this optimization by hand, but using a higher level language is
often better for maintenance.


On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <ru...@gmail.com>wrote:

> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
> or Hive. You can do this is a couple lines of code, I suspect. Two map
> reduce jobs should not pose any kind of challenge with the right tools.
>
>
> On Monday, March 4, 2013, Sandy Ryza wrote:
>
>> Hi Aji,
>>
>> Oozie is a mature project for managing MapReduce workflows.
>> http://oozie.apache.org/
>>
>> -Sandy
>>
>>
>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com>wrote:
>>
>>> Aji,
>>>
>>> Why don't you just chain the jobs together?
>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>
>>> Justin
>>>
>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
>>> > Russell thanks for the link.
>>> >
>>> > I am interested in finding a solution (if out there) where Mapper1
>>> outputs a
>>> > custom object and Mapper 2 can use that as input. One way to do this
>>> > obviously by writing to Accumulo, in my case. But, is there another
>>> solution
>>> > for this:
>>> >
>>> > List<MyObject> ----> Input to Job
>>> >
>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>> <MyObjectId,
>>> > MyObject>
>>> >
>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>> >
>>> >
>>> >
>>> > Ideas?
>>> >
>>> >
>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>> russell.jurney@gmail.com>
>>> > wrote:
>>> >>
>>> >>
>>> >>
>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>> >>
>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>>> try
>>> >> it.
>>> >>
>>> >> Russell Jurney http://datasyndrome.com
>>> >>
>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>> output goes
>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>> >>
>>> >> Questions:
>>> >>
>>> >> 1) Has any one tried something like this before? Are there any
>>> workflow
>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>> job like
>>> >> this. Or am I limited to use Quartz for this?
>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>> mapreduce
>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>> (starting
>>> >> point/best practices).
>>> >>
>>> >> Thank you in advance for any suggestions!
>>> >>
>>> >> Aji
>>> >>
>>> >
>>>
>>
>>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>

Re: Accumulo and Mapreduce

Posted by Russell Jurney <ru...@gmail.com>.

You can chain MR jobs with Oozie, but would suggest using Cascading, Pig or
Hive. You can do this is a couple lines of code, I suspect. Two map reduce
jobs should not pose any kind of challenge with the right tools.

On Monday, March 4, 2013, Sandy Ryza wrote:

> Hi Aji,
>
> Oozie is a mature project for managing MapReduce workflows.
> http://oozie.apache.org/
>
> -Sandy
>
>
> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <justin.woody@gmail.com<javascript:_e({}, 'cvml', 'justin.woody@gmail.com');>
> > wrote:
>
>> Aji,
>>
>> Why don't you just chain the jobs together?
>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>
>> Justin
>>
>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aji1705@gmail.com<javascript:_e({}, 'cvml', 'aji1705@gmail.com');>>
>> wrote:
>> > Russell thanks for the link.
>> >
>> > I am interested in finding a solution (if out there) where Mapper1
>> outputs a
>> > custom object and Mapper 2 can use that as input. One way to do this
>> > obviously by writing to Accumulo, in my case. But, is there another
>> solution
>> > for this:
>> >
>> > List<MyObject> ----> Input to Job
>> >
>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>> <MyObjectId,
>> > MyObject>
>> >
>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>> >
>> >
>> >
>> > Ideas?
>> >
>> >
>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>> russell.jurney@gmail.com <javascript:_e({}, 'cvml',
>> 'russell.jurney@gmail.com');>>
>> > wrote:
>> >>
>> >>
>> >>
>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>> >>
>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>> try
>> >> it.
>> >>
>> >> Russell Jurney http://datasyndrome.com
>> >>
>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aji1705@gmail.com<javascript:_e({}, 'cvml', 'aji1705@gmail.com');>>
>> wrote:
>> >>
>> >> Hello,
>> >>
>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output
>> goes
>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>> >>
>> >> Questions:
>> >>
>> >> 1) Has any one tried something like this before? Are there any workflow
>> >> control apis (in or outside of Hadoop) that can help me set up the job
>> like
>> >> this. Or am I limited to use Quartz for this?
>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>> >> Accumulo, is it possible to do so? Are there any good accumulo
>> mapreduce
>> >> jobs you can point me to? blogs/pages that I can use for reference
>> (starting
>> >> point/best practices).
>> >>
>> >> Thank you in advance for any suggestions!
>> >>
>> >> Aji
>> >>
>> >
>>
>
>

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Accumulo and Mapreduce

Posted by Russell Jurney <ru...@gmail.com>.

You can chain MR jobs with Oozie, but would suggest using Cascading, Pig or
Hive. You can do this is a couple lines of code, I suspect. Two map reduce
jobs should not pose any kind of challenge with the right tools.

On Monday, March 4, 2013, Sandy Ryza wrote:

> Hi Aji,
>
> Oozie is a mature project for managing MapReduce workflows.
> http://oozie.apache.org/
>
> -Sandy
>
>
> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <justin.woody@gmail.com<javascript:_e({}, 'cvml', 'justin.woody@gmail.com');>
> > wrote:
>
>> Aji,
>>
>> Why don't you just chain the jobs together?
>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>
>> Justin
>>
>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aji1705@gmail.com<javascript:_e({}, 'cvml', 'aji1705@gmail.com');>>
>> wrote:
>> > Russell thanks for the link.
>> >
>> > I am interested in finding a solution (if out there) where Mapper1
>> outputs a
>> > custom object and Mapper 2 can use that as input. One way to do this
>> > obviously by writing to Accumulo, in my case. But, is there another
>> solution
>> > for this:
>> >
>> > List<MyObject> ----> Input to Job
>> >
>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>> <MyObjectId,
>> > MyObject>
>> >
>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>> >
>> >
>> >
>> > Ideas?
>> >
>> >
>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>> russell.jurney@gmail.com <javascript:_e({}, 'cvml',
>> 'russell.jurney@gmail.com');>>
>> > wrote:
>> >>
>> >>
>> >>
>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>> >>
>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>> try
>> >> it.
>> >>
>> >> Russell Jurney http://datasyndrome.com
>> >>
>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aji1705@gmail.com<javascript:_e({}, 'cvml', 'aji1705@gmail.com');>>
>> wrote:
>> >>
>> >> Hello,
>> >>
>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output
>> goes
>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>> >>
>> >> Questions:
>> >>
>> >> 1) Has any one tried something like this before? Are there any workflow
>> >> control apis (in or outside of Hadoop) that can help me set up the job
>> like
>> >> this. Or am I limited to use Quartz for this?
>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>> >> Accumulo, is it possible to do so? Are there any good accumulo
>> mapreduce
>> >> jobs you can point me to? blogs/pages that I can use for reference
>> (starting
>> >> point/best practices).
>> >>
>> >> Thank you in advance for any suggestions!
>> >>
>> >> Aji
>> >>
>> >
>>
>
>

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Accumulo and Mapreduce

Posted by Russell Jurney <ru...@gmail.com>.

You can chain MR jobs with Oozie, but would suggest using Cascading, Pig or
Hive. You can do this is a couple lines of code, I suspect. Two map reduce
jobs should not pose any kind of challenge with the right tools.

On Monday, March 4, 2013, Sandy Ryza wrote:

> Hi Aji,
>
> Oozie is a mature project for managing MapReduce workflows.
> http://oozie.apache.org/
>
> -Sandy
>
>
> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <justin.woody@gmail.com<javascript:_e({}, 'cvml', 'justin.woody@gmail.com');>
> > wrote:
>
>> Aji,
>>
>> Why don't you just chain the jobs together?
>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>
>> Justin
>>
>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aji1705@gmail.com<javascript:_e({}, 'cvml', 'aji1705@gmail.com');>>
>> wrote:
>> > Russell thanks for the link.
>> >
>> > I am interested in finding a solution (if out there) where Mapper1
>> outputs a
>> > custom object and Mapper 2 can use that as input. One way to do this
>> > obviously by writing to Accumulo, in my case. But, is there another
>> solution
>> > for this:
>> >
>> > List<MyObject> ----> Input to Job
>> >
>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>> <MyObjectId,
>> > MyObject>
>> >
>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>> >
>> >
>> >
>> > Ideas?
>> >
>> >
>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>> russell.jurney@gmail.com <javascript:_e({}, 'cvml',
>> 'russell.jurney@gmail.com');>>
>> > wrote:
>> >>
>> >>
>> >>
>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>> >>
>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>> try
>> >> it.
>> >>
>> >> Russell Jurney http://datasyndrome.com
>> >>
>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aji1705@gmail.com<javascript:_e({}, 'cvml', 'aji1705@gmail.com');>>
>> wrote:
>> >>
>> >> Hello,
>> >>
>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output
>> goes
>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>> >>
>> >> Questions:
>> >>
>> >> 1) Has any one tried something like this before? Are there any workflow
>> >> control apis (in or outside of Hadoop) that can help me set up the job
>> like
>> >> this. Or am I limited to use Quartz for this?
>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>> >> Accumulo, is it possible to do so? Are there any good accumulo
>> mapreduce
>> >> jobs you can point me to? blogs/pages that I can use for reference
>> (starting
>> >> point/best practices).
>> >>
>> >> Thank you in advance for any suggestions!
>> >>
>> >> Aji
>> >>
>> >
>>
>
>

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Accumulo and Mapreduce

Posted by Russell Jurney <ru...@gmail.com>.

You can chain MR jobs with Oozie, but would suggest using Cascading, Pig or
Hive. You can do this is a couple lines of code, I suspect. Two map reduce
jobs should not pose any kind of challenge with the right tools.

On Monday, March 4, 2013, Sandy Ryza wrote:

> Hi Aji,
>
> Oozie is a mature project for managing MapReduce workflows.
> http://oozie.apache.org/
>
> -Sandy
>
>
> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <justin.woody@gmail.com<javascript:_e({}, 'cvml', 'justin.woody@gmail.com');>
> > wrote:
>
>> Aji,
>>
>> Why don't you just chain the jobs together?
>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>
>> Justin
>>
>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aji1705@gmail.com<javascript:_e({}, 'cvml', 'aji1705@gmail.com');>>
>> wrote:
>> > Russell thanks for the link.
>> >
>> > I am interested in finding a solution (if out there) where Mapper1
>> outputs a
>> > custom object and Mapper 2 can use that as input. One way to do this
>> > obviously by writing to Accumulo, in my case. But, is there another
>> solution
>> > for this:
>> >
>> > List<MyObject> ----> Input to Job
>> >
>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>> <MyObjectId,
>> > MyObject>
>> >
>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>> >
>> >
>> >
>> > Ideas?
>> >
>> >
>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>> russell.jurney@gmail.com <javascript:_e({}, 'cvml',
>> 'russell.jurney@gmail.com');>>
>> > wrote:
>> >>
>> >>
>> >>
>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>> >>
>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>> try
>> >> it.
>> >>
>> >> Russell Jurney http://datasyndrome.com
>> >>
>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aji1705@gmail.com<javascript:_e({}, 'cvml', 'aji1705@gmail.com');>>
>> wrote:
>> >>
>> >> Hello,
>> >>
>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output
>> goes
>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>> >>
>> >> Questions:
>> >>
>> >> 1) Has any one tried something like this before? Are there any workflow
>> >> control apis (in or outside of Hadoop) that can help me set up the job
>> like
>> >> this. Or am I limited to use Quartz for this?
>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>> >> Accumulo, is it possible to do so? Are there any good accumulo
>> mapreduce
>> >> jobs you can point me to? blogs/pages that I can use for reference
>> (starting
>> >> point/best practices).
>> >>
>> >> Thank you in advance for any suggestions!
>> >>
>> >> Aji
>> >>
>> >
>>
>
>

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Accumulo and Mapreduce

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Aji,

Oozie is a mature project for managing MapReduce workflows.
http://oozie.apache.org/

-Sandy


On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com> wrote:

> Aji,
>
> Why don't you just chain the jobs together?
> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>
> Justin
>
> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
> > Russell thanks for the link.
> >
> > I am interested in finding a solution (if out there) where Mapper1
> outputs a
> > custom object and Mapper 2 can use that as input. One way to do this
> > obviously by writing to Accumulo, in my case. But, is there another
> solution
> > for this:
> >
> > List<MyObject> ----> Input to Job
> >
> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
> <MyObjectId,
> > MyObject>
> >
> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
> >
> >
> >
> > Ideas?
> >
> >
> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
> russell.jurney@gmail.com>
> > wrote:
> >>
> >>
> >>
> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
> >>
> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
> >> it.
> >>
> >> Russell Jurney http://datasyndrome.com
> >>
> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output
> goes
> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
> >>
> >> Questions:
> >>
> >> 1) Has any one tried something like this before? Are there any workflow
> >> control apis (in or outside of Hadoop) that can help me set up the job
> like
> >> this. Or am I limited to use Quartz for this?
> >> 2) If both M2 and M3 needed to write some data to two same tables in
> >> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
> >> jobs you can point me to? blogs/pages that I can use for reference
> (starting
> >> point/best practices).
> >>
> >> Thank you in advance for any suggestions!
> >>
> >> Aji
> >>
> >
>

Re: Accumulo and Mapreduce

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Aji,

Oozie is a mature project for managing MapReduce workflows.
http://oozie.apache.org/

-Sandy


On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com> wrote:

> Aji,
>
> Why don't you just chain the jobs together?
> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>
> Justin
>
> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
> > Russell thanks for the link.
> >
> > I am interested in finding a solution (if out there) where Mapper1
> outputs a
> > custom object and Mapper 2 can use that as input. One way to do this
> > obviously by writing to Accumulo, in my case. But, is there another
> solution
> > for this:
> >
> > List<MyObject> ----> Input to Job
> >
> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
> <MyObjectId,
> > MyObject>
> >
> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
> >
> >
> >
> > Ideas?
> >
> >
> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
> russell.jurney@gmail.com>
> > wrote:
> >>
> >>
> >>
> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
> >>
> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
> >> it.
> >>
> >> Russell Jurney http://datasyndrome.com
> >>
> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output
> goes
> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
> >>
> >> Questions:
> >>
> >> 1) Has any one tried something like this before? Are there any workflow
> >> control apis (in or outside of Hadoop) that can help me set up the job
> like
> >> this. Or am I limited to use Quartz for this?
> >> 2) If both M2 and M3 needed to write some data to two same tables in
> >> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
> >> jobs you can point me to? blogs/pages that I can use for reference
> (starting
> >> point/best practices).
> >>
> >> Thank you in advance for any suggestions!
> >>
> >> Aji
> >>
> >
>

Re: Accumulo and Mapreduce

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Aji,

Oozie is a mature project for managing MapReduce workflows.
http://oozie.apache.org/

-Sandy


On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com> wrote:

> Aji,
>
> Why don't you just chain the jobs together?
> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>
> Justin
>
> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
> > Russell thanks for the link.
> >
> > I am interested in finding a solution (if out there) where Mapper1
> outputs a
> > custom object and Mapper 2 can use that as input. One way to do this
> > obviously by writing to Accumulo, in my case. But, is there another
> solution
> > for this:
> >
> > List<MyObject> ----> Input to Job
> >
> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
> <MyObjectId,
> > MyObject>
> >
> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
> >
> >
> >
> > Ideas?
> >
> >
> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
> russell.jurney@gmail.com>
> > wrote:
> >>
> >>
> >>
> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
> >>
> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
> >> it.
> >>
> >> Russell Jurney http://datasyndrome.com
> >>
> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output
> goes
> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
> >>
> >> Questions:
> >>
> >> 1) Has any one tried something like this before? Are there any workflow
> >> control apis (in or outside of Hadoop) that can help me set up the job
> like
> >> this. Or am I limited to use Quartz for this?
> >> 2) If both M2 and M3 needed to write some data to two same tables in
> >> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
> >> jobs you can point me to? blogs/pages that I can use for reference
> (starting
> >> point/best practices).
> >>
> >> Thank you in advance for any suggestions!
> >>
> >> Aji
> >>
> >
>

Re: Accumulo and Mapreduce

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Aji,

Oozie is a mature project for managing MapReduce workflows.
http://oozie.apache.org/

-Sandy


On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <ju...@gmail.com> wrote:

> Aji,
>
> Why don't you just chain the jobs together?
> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>
> Justin
>
> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
> > Russell thanks for the link.
> >
> > I am interested in finding a solution (if out there) where Mapper1
> outputs a
> > custom object and Mapper 2 can use that as input. One way to do this
> > obviously by writing to Accumulo, in my case. But, is there another
> solution
> > for this:
> >
> > List<MyObject> ----> Input to Job
> >
> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
> <MyObjectId,
> > MyObject>
> >
> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
> >
> >
> >
> > Ideas?
> >
> >
> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
> russell.jurney@gmail.com>
> > wrote:
> >>
> >>
> >>
> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
> >>
> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
> >> it.
> >>
> >> Russell Jurney http://datasyndrome.com
> >>
> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output
> goes
> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
> >>
> >> Questions:
> >>
> >> 1) Has any one tried something like this before? Are there any workflow
> >> control apis (in or outside of Hadoop) that can help me set up the job
> like
> >> this. Or am I limited to use Quartz for this?
> >> 2) If both M2 and M3 needed to write some data to two same tables in
> >> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
> >> jobs you can point me to? blogs/pages that I can use for reference
> (starting
> >> point/best practices).
> >>
> >> Thank you in advance for any suggestions!
> >>
> >> Aji
> >>
> >
>

Re: Accumulo and Mapreduce

Posted by Justin Woody <ju...@gmail.com>.

Aji,

Why don't you just chain the jobs together?
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

Justin

On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
> Russell thanks for the link.
>
> I am interested in finding a solution (if out there) where Mapper1 outputs a
> custom object and Mapper 2 can use that as input. One way to do this
> obviously by writing to Accumulo, in my case. But, is there another solution
> for this:
>
> List<MyObject> ----> Input to Job
>
> MyObject ---> Input to Mapper1 (process MyObject) ----> Output <MyObjectId,
> MyObject>
>
> <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>
>
>
> Ideas?
>
>
> On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <ru...@gmail.com>
> wrote:
>>
>>
>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>
>> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
>> it.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>
>> Hello,
>>
>>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
>> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>
>> Questions:
>>
>> 1) Has any one tried something like this before? Are there any workflow
>> control apis (in or outside of Hadoop) that can help me set up the job like
>> this. Or am I limited to use Quartz for this?
>> 2) If both M2 and M3 needed to write some data to two same tables in
>> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
>> jobs you can point me to? blogs/pages that I can use for reference (starting
>> point/best practices).
>>
>> Thank you in advance for any suggestions!
>>
>> Aji
>>
>

Re: Accumulo and Mapreduce

Posted by Justin Woody <ju...@gmail.com>.

Aji,

Why don't you just chain the jobs together?
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

Justin

On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
> Russell thanks for the link.
>
> I am interested in finding a solution (if out there) where Mapper1 outputs a
> custom object and Mapper 2 can use that as input. One way to do this
> obviously by writing to Accumulo, in my case. But, is there another solution
> for this:
>
> List<MyObject> ----> Input to Job
>
> MyObject ---> Input to Mapper1 (process MyObject) ----> Output <MyObjectId,
> MyObject>
>
> <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>
>
>
> Ideas?
>
>
> On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <ru...@gmail.com>
> wrote:
>>
>>
>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>
>> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
>> it.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>
>> Hello,
>>
>>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
>> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>
>> Questions:
>>
>> 1) Has any one tried something like this before? Are there any workflow
>> control apis (in or outside of Hadoop) that can help me set up the job like
>> this. Or am I limited to use Quartz for this?
>> 2) If both M2 and M3 needed to write some data to two same tables in
>> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
>> jobs you can point me to? blogs/pages that I can use for reference (starting
>> point/best practices).
>>
>> Thank you in advance for any suggestions!
>>
>> Aji
>>
>

Re: Accumulo and Mapreduce

Posted by Justin Woody <ju...@gmail.com>.

Aji,

Why don't you just chain the jobs together?
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

Justin

On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
> Russell thanks for the link.
>
> I am interested in finding a solution (if out there) where Mapper1 outputs a
> custom object and Mapper 2 can use that as input. One way to do this
> obviously by writing to Accumulo, in my case. But, is there another solution
> for this:
>
> List<MyObject> ----> Input to Job
>
> MyObject ---> Input to Mapper1 (process MyObject) ----> Output <MyObjectId,
> MyObject>
>
> <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>
>
>
> Ideas?
>
>
> On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <ru...@gmail.com>
> wrote:
>>
>>
>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>
>> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
>> it.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>
>> Hello,
>>
>>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
>> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>
>> Questions:
>>
>> 1) Has any one tried something like this before? Are there any workflow
>> control apis (in or outside of Hadoop) that can help me set up the job like
>> this. Or am I limited to use Quartz for this?
>> 2) If both M2 and M3 needed to write some data to two same tables in
>> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
>> jobs you can point me to? blogs/pages that I can use for reference (starting
>> point/best practices).
>>
>> Thank you in advance for any suggestions!
>>
>> Aji
>>
>

Re: Accumulo and Mapreduce

Posted by Justin Woody <ju...@gmail.com>.

Aji,

Why don't you just chain the jobs together?
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

Justin

On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aj...@gmail.com> wrote:
> Russell thanks for the link.
>
> I am interested in finding a solution (if out there) where Mapper1 outputs a
> custom object and Mapper 2 can use that as input. One way to do this
> obviously by writing to Accumulo, in my case. But, is there another solution
> for this:
>
> List<MyObject> ----> Input to Job
>
> MyObject ---> Input to Mapper1 (process MyObject) ----> Output <MyObjectId,
> MyObject>
>
> <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>
>
>
> Ideas?
>
>
> On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <ru...@gmail.com>
> wrote:
>>
>>
>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>
>> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
>> it.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>>
>> Hello,
>>
>>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
>> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>
>> Questions:
>>
>> 1) Has any one tried something like this before? Are there any workflow
>> control apis (in or outside of Hadoop) that can help me set up the job like
>> this. Or am I limited to use Quartz for this?
>> 2) If both M2 and M3 needed to write some data to two same tables in
>> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
>> jobs you can point me to? blogs/pages that I can use for reference (starting
>> point/best practices).
>>
>> Thank you in advance for any suggestions!
>>
>> Aji
>>
>

Re: Accumulo and Mapreduce

Posted by Aji Janis <aj...@gmail.com>.

Russell thanks for the link.

I am interested in finding a solution (if out there) where Mapper1 outputs
a custom object and Mapper 2 can use that as input. One way to do this
obviously by writing to Accumulo, in my case. But, is there another
solution for this:

List<MyObject> ----> Input to Job

MyObject ---> Input to Mapper1 (process MyObject) ----> Output <MyObjectId,
MyObject>

<MyObjectId, MyObject> are Input to Mapper2 ... and so on



Ideas?


On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <ru...@gmail.com>wrote:

>
> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>
> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
> it.
>
> Russell Jurney http://datasyndrome.com
>
> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>
> Hello,
>
>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>
> Questions:
>
> 1) Has any one tried something like this before? Are there any workflow
> control apis (in or outside of Hadoop) that can help me set up the job like
> this. Or am I limited to use Quartz for this?
> 2) If both M2 and M3 needed to write some data to two same tables in
> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
> jobs you can point me to? blogs/pages that I can use for reference
> (starting point/best practices).
>
> Thank you in advance for any suggestions!
>
> Aji
>
>

Re: Accumulo and Mapreduce

Posted by Aji Janis <aj...@gmail.com>.

Russell thanks for the link.

I am interested in finding a solution (if out there) where Mapper1 outputs
a custom object and Mapper 2 can use that as input. One way to do this
obviously by writing to Accumulo, in my case. But, is there another
solution for this:

List<MyObject> ----> Input to Job

MyObject ---> Input to Mapper1 (process MyObject) ----> Output <MyObjectId,
MyObject>

<MyObjectId, MyObject> are Input to Mapper2 ... and so on



Ideas?


On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <ru...@gmail.com>wrote:

>
> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>
> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
> it.
>
> Russell Jurney http://datasyndrome.com
>
> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>
> Hello,
>
>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>
> Questions:
>
> 1) Has any one tried something like this before? Are there any workflow
> control apis (in or outside of Hadoop) that can help me set up the job like
> this. Or am I limited to use Quartz for this?
> 2) If both M2 and M3 needed to write some data to two same tables in
> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
> jobs you can point me to? blogs/pages that I can use for reference
> (starting point/best practices).
>
> Thank you in advance for any suggestions!
>
> Aji
>
>

Re: Accumulo and Mapreduce

Posted by Aji Janis <aj...@gmail.com>.

Russell thanks for the link.

I am interested in finding a solution (if out there) where Mapper1 outputs
a custom object and Mapper 2 can use that as input. One way to do this
obviously by writing to Accumulo, in my case. But, is there another
solution for this:

List<MyObject> ----> Input to Job

MyObject ---> Input to Mapper1 (process MyObject) ----> Output <MyObjectId,
MyObject>

<MyObjectId, MyObject> are Input to Mapper2 ... and so on



Ideas?


On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <ru...@gmail.com>wrote:

>
> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>
> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
> it.
>
> Russell Jurney http://datasyndrome.com
>
> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>
> Hello,
>
>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>
> Questions:
>
> 1) Has any one tried something like this before? Are there any workflow
> control apis (in or outside of Hadoop) that can help me set up the job like
> this. Or am I limited to use Quartz for this?
> 2) If both M2 and M3 needed to write some data to two same tables in
> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
> jobs you can point me to? blogs/pages that I can use for reference
> (starting point/best practices).
>
> Thank you in advance for any suggestions!
>
> Aji
>
>

Re: Accumulo and Mapreduce

Posted by Aji Janis <aj...@gmail.com>.

Russell thanks for the link.

I am interested in finding a solution (if out there) where Mapper1 outputs
a custom object and Mapper 2 can use that as input. One way to do this
obviously by writing to Accumulo, in my case. But, is there another
solution for this:

List<MyObject> ----> Input to Job

MyObject ---> Input to Mapper1 (process MyObject) ----> Output <MyObjectId,
MyObject>

<MyObjectId, MyObject> are Input to Mapper2 ... and so on



Ideas?


On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <ru...@gmail.com>wrote:

>
> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>
> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try
> it.
>
> Russell Jurney http://datasyndrome.com
>
> On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:
>
> Hello,
>
>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>
> Questions:
>
> 1) Has any one tried something like this before? Are there any workflow
> control apis (in or outside of Hadoop) that can help me set up the job like
> this. Or am I limited to use Quartz for this?
> 2) If both M2 and M3 needed to write some data to two same tables in
> Accumulo, is it possible to do so? Are there any good accumulo mapreduce
> jobs you can point me to? blogs/pages that I can use for reference
> (starting point/best practices).
>
> Thank you in advance for any suggestions!
>
> Aji
>
>

Re: Accumulo and Mapreduce

Posted by Russell Jurney <ru...@gmail.com>.

http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java

AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try it.

Russell Jurney http://datasyndrome.com

On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:

Hello,

 I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
to M2.. and so on. Finally the Reducer writes output to Accumulo.

Questions:

1) Has any one tried something like this before? Are there any workflow
control apis (in or outside of Hadoop) that can help me set up the job like
this. Or am I limited to use Quartz for this?
2) If both M2 and M3 needed to write some data to two same tables in
Accumulo, is it possible to do so? Are there any good accumulo mapreduce
jobs you can point me to? blogs/pages that I can use for reference
(starting point/best practices).

Thank you in advance for any suggestions!

Aji

Re: Accumulo and Mapreduce

Posted by Russell Jurney <ru...@gmail.com>.

http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java

AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try it.

Russell Jurney http://datasyndrome.com

On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:

Hello,

 I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
to M2.. and so on. Finally the Reducer writes output to Accumulo.

Questions:

1) Has any one tried something like this before? Are there any workflow
control apis (in or outside of Hadoop) that can help me set up the job like
this. Or am I limited to use Quartz for this?
2) If both M2 and M3 needed to write some data to two same tables in
Accumulo, is it possible to do so? Are there any good accumulo mapreduce
jobs you can point me to? blogs/pages that I can use for reference
(starting point/best practices).

Thank you in advance for any suggestions!

Aji

Re: Accumulo and Mapreduce

Posted by Russell Jurney <ru...@gmail.com>.

http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java

AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try it.

Russell Jurney http://datasyndrome.com

On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:

Hello,

 I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
to M2.. and so on. Finally the Reducer writes output to Accumulo.

Questions:

1) Has any one tried something like this before? Are there any workflow
control apis (in or outside of Hadoop) that can help me set up the job like
this. Or am I limited to use Quartz for this?
2) If both M2 and M3 needed to write some data to two same tables in
Accumulo, is it possible to do so? Are there any good accumulo mapreduce
jobs you can point me to? blogs/pages that I can use for reference
(starting point/best practices).

Thank you in advance for any suggestions!

Aji

Re: Accumulo and Mapreduce

Posted by Russell Jurney <ru...@gmail.com>.

http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java

AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try it.

Russell Jurney http://datasyndrome.com

On Mar 4, 2013, at 5:30 AM, Aji Janis <aj...@gmail.com> wrote:

Hello,

 I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's output goes
to M2.. and so on. Finally the Reducer writes output to Accumulo.

Questions:

1) Has any one tried something like this before? Are there any workflow
control apis (in or outside of Hadoop) that can help me set up the job like
this. Or am I limited to use Quartz for this?
2) If both M2 and M3 needed to write some data to two same tables in
Accumulo, is it possible to do so? Are there any good accumulo mapreduce
jobs you can point me to? blogs/pages that I can use for reference
(starting point/best practices).

Thank you in advance for any suggestions!

Aji