You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Aji Janis <aj...@gmail.com> on 2013/03/01 20:48:25 UTC

Java jars and MapReduce

Hello,

Current Design: I have a java object MyObjectA. MyObjectA goes through
Three processors (jars) that are run in sequence and do a lot of processing
to beef up A with tons of additional stuff (think ETL) and the final result
is MyObjectD (note: MyObjectD is really A with more fields if you will
added to it but I wanted to clarify here that they are very different).
MyObjectD when ready is saved to my non relational database (accumulo).
Currently, all this is done by making use of Quartz Scheduler - a
List<MyObjectA> is submitted for processing every N mintues. Everything is
written in Java and there is a lot of talking back n forth with Accumulo
(to access tables that will help convert A to D).

We split the processing into three processors just because it was more
convenient and we didn't want everything rolled up in one processor. Having
said that I can definitely merge the three into ONE processor. But my
question is, what are all the things (obviously generically speaking) I
need to be concerned about/ look into to make this a map reduce job? I am
asking for pointers on where to even start here.

Lets say, all my processing is done in mappers. So my input will be
MyObjectA and my output will be MyObjectD from each mapper. And then my
reducer simple writes all MyObjectD objects to accumulo. Is achieving this
as easy as just submitting the jar to hadoop ????

I guess overall, I want to know how does one go about blindly submitting a
.jar (java apps) and make this a map reduce task.
We are going this route, because multi-threading won't solve our problem.
We have to process objects in batch now and they are coming in every
minute.

Thank you in advance for any and all help.

Re: Java jars and MapReduce

Posted by Aji Janis <aj...@gmail.com>.

Steve,

Thank you for the response. Please see below.

1) Java processing is the rate limiting step. Are you saying Accumulo isn't
designed to run mapreduce ? Can you elaborate please.

I absolutely agree that my implementation should be a chain of map reduce
jobs. Unfortunately, at this time there are reasons to roll up all the
processing steps into ONE JAR. Going forward I see that to not be the case.
Having said that I think two of the interesting challenges would be solr
indexing MyObjectD as its ready. Do you have any insights into how well
solr indexing works with mapreduce? Issues? Workarounds?

Another interesting thing is I would like to use Log4j for periodic (and a
lot of ) logging. Can I do this? If yes, where are these files written?

Thank you again for  your feedback.

On Fri, Mar 1, 2013 at 11:21 PM, Steve Lewis <lo...@gmail.com> wrote:

> A few basic questions -
> 1) is the rate limiting step the Java processing or storage in accumulo.
> Hadoop may not be able to speed up a database which is not designed to work
> in a distributed manner.
>
> B)  Can ObjectD or any intermediate objects be serialized  possibly to xml
> and efficiently deserialized. If so they can be passed outside the databse.
> If so you might consider passing the serialized form and maybe several jobs
> in series.
>
> C) Alternatively if the object can be defined by a single identifier in
> the database and intermediate steps are in the database and if the
> processing and not the database is the rate limiting step then every step
> could be a separate job with a custom InputFormat passing ids to the mapper.
>
> You need to spend a lot of time thinking about which steps in the
> processing are rate limiting and how much of the performance bottlenecks
> arein the database.
>
> Steven M. Lewis PhD
>  On Mar 1, 2013 9:48 AM, "Aji Janis" <aj...@gmail.com> wrote:
>
>> Hello,
>>
>> Current Design: I have a java object MyObjectA. MyObjectA goes through
>> Three processors (jars) that are run in sequence and do a lot of processing
>> to beef up A with tons of additional stuff (think ETL) and the final result
>> is MyObjectD (note: MyObjectD is really A with more fields if you will
>> added to it but I wanted to clarify here that they are very different).
>> MyObjectD when ready is saved to my non relational database (accumulo).
>> Currently, all this is done by making use of Quartz Scheduler - a
>> List<MyObjectA> is submitted for processing every N mintues. Everything is
>> written in Java and there is a lot of talking back n forth with Accumulo
>> (to access tables that will help convert A to D).
>>
>> We split the processing into three processors just because it was more
>> convenient and we didn't want everything rolled up in one processor. Having
>> said that I can definitely merge the three into ONE processor. But my
>> question is, what are all the things (obviously generically speaking) I
>> need to be concerned about/ look into to make this a map reduce job? I am
>> asking for pointers on where to even start here.
>>
>> Lets say, all my processing is done in mappers. So my input will be
>> MyObjectA and my output will be MyObjectD from each mapper. And then my
>> reducer simple writes all MyObjectD objects to accumulo. Is achieving this
>> as easy as just submitting the jar to hadoop ????
>>
>> I guess overall, I want to know how does one go about blindly submitting
>> a .jar (java apps) and make this a map reduce task.
>> We are going this route, because multi-threading won't solve our problem.
>> We have to process objects in batch now and they are coming in every
>> minute.
>>
>> Thank you in advance for any and all help.
>>
>>
>>

Re: Java jars and MapReduce

Posted by Aji Janis <aj...@gmail.com>.

Steve,

Thank you for the response. Please see below.

1) Java processing is the rate limiting step. Are you saying Accumulo isn't
designed to run mapreduce ? Can you elaborate please.

I absolutely agree that my implementation should be a chain of map reduce
jobs. Unfortunately, at this time there are reasons to roll up all the
processing steps into ONE JAR. Going forward I see that to not be the case.
Having said that I think two of the interesting challenges would be solr
indexing MyObjectD as its ready. Do you have any insights into how well
solr indexing works with mapreduce? Issues? Workarounds?

Another interesting thing is I would like to use Log4j for periodic (and a
lot of ) logging. Can I do this? If yes, where are these files written?

Thank you again for  your feedback.

On Fri, Mar 1, 2013 at 11:21 PM, Steve Lewis <lo...@gmail.com> wrote:

> A few basic questions -
> 1) is the rate limiting step the Java processing or storage in accumulo.
> Hadoop may not be able to speed up a database which is not designed to work
> in a distributed manner.
>
> B)  Can ObjectD or any intermediate objects be serialized  possibly to xml
> and efficiently deserialized. If so they can be passed outside the databse.
> If so you might consider passing the serialized form and maybe several jobs
> in series.
>
> C) Alternatively if the object can be defined by a single identifier in
> the database and intermediate steps are in the database and if the
> processing and not the database is the rate limiting step then every step
> could be a separate job with a custom InputFormat passing ids to the mapper.
>
> You need to spend a lot of time thinking about which steps in the
> processing are rate limiting and how much of the performance bottlenecks
> arein the database.
>
> Steven M. Lewis PhD
>  On Mar 1, 2013 9:48 AM, "Aji Janis" <aj...@gmail.com> wrote:
>
>> Hello,
>>
>> Current Design: I have a java object MyObjectA. MyObjectA goes through
>> Three processors (jars) that are run in sequence and do a lot of processing
>> to beef up A with tons of additional stuff (think ETL) and the final result
>> is MyObjectD (note: MyObjectD is really A with more fields if you will
>> added to it but I wanted to clarify here that they are very different).
>> MyObjectD when ready is saved to my non relational database (accumulo).
>> Currently, all this is done by making use of Quartz Scheduler - a
>> List<MyObjectA> is submitted for processing every N mintues. Everything is
>> written in Java and there is a lot of talking back n forth with Accumulo
>> (to access tables that will help convert A to D).
>>
>> We split the processing into three processors just because it was more
>> convenient and we didn't want everything rolled up in one processor. Having
>> said that I can definitely merge the three into ONE processor. But my
>> question is, what are all the things (obviously generically speaking) I
>> need to be concerned about/ look into to make this a map reduce job? I am
>> asking for pointers on where to even start here.
>>
>> Lets say, all my processing is done in mappers. So my input will be
>> MyObjectA and my output will be MyObjectD from each mapper. And then my
>> reducer simple writes all MyObjectD objects to accumulo. Is achieving this
>> as easy as just submitting the jar to hadoop ????
>>
>> I guess overall, I want to know how does one go about blindly submitting
>> a .jar (java apps) and make this a map reduce task.
>> We are going this route, because multi-threading won't solve our problem.
>> We have to process objects in batch now and they are coming in every
>> minute.
>>
>> Thank you in advance for any and all help.
>>
>>
>>

Re: Java jars and MapReduce

Posted by Aji Janis <aj...@gmail.com>.

Steve,

Thank you for the response. Please see below.

1) Java processing is the rate limiting step. Are you saying Accumulo isn't
designed to run mapreduce ? Can you elaborate please.

I absolutely agree that my implementation should be a chain of map reduce
jobs. Unfortunately, at this time there are reasons to roll up all the
processing steps into ONE JAR. Going forward I see that to not be the case.
Having said that I think two of the interesting challenges would be solr
indexing MyObjectD as its ready. Do you have any insights into how well
solr indexing works with mapreduce? Issues? Workarounds?

Another interesting thing is I would like to use Log4j for periodic (and a
lot of ) logging. Can I do this? If yes, where are these files written?

Thank you again for  your feedback.

On Fri, Mar 1, 2013 at 11:21 PM, Steve Lewis <lo...@gmail.com> wrote:

> A few basic questions -
> 1) is the rate limiting step the Java processing or storage in accumulo.
> Hadoop may not be able to speed up a database which is not designed to work
> in a distributed manner.
>
> B)  Can ObjectD or any intermediate objects be serialized  possibly to xml
> and efficiently deserialized. If so they can be passed outside the databse.
> If so you might consider passing the serialized form and maybe several jobs
> in series.
>
> C) Alternatively if the object can be defined by a single identifier in
> the database and intermediate steps are in the database and if the
> processing and not the database is the rate limiting step then every step
> could be a separate job with a custom InputFormat passing ids to the mapper.
>
> You need to spend a lot of time thinking about which steps in the
> processing are rate limiting and how much of the performance bottlenecks
> arein the database.
>
> Steven M. Lewis PhD
>  On Mar 1, 2013 9:48 AM, "Aji Janis" <aj...@gmail.com> wrote:
>
>> Hello,
>>
>> Current Design: I have a java object MyObjectA. MyObjectA goes through
>> Three processors (jars) that are run in sequence and do a lot of processing
>> to beef up A with tons of additional stuff (think ETL) and the final result
>> is MyObjectD (note: MyObjectD is really A with more fields if you will
>> added to it but I wanted to clarify here that they are very different).
>> MyObjectD when ready is saved to my non relational database (accumulo).
>> Currently, all this is done by making use of Quartz Scheduler - a
>> List<MyObjectA> is submitted for processing every N mintues. Everything is
>> written in Java and there is a lot of talking back n forth with Accumulo
>> (to access tables that will help convert A to D).
>>
>> We split the processing into three processors just because it was more
>> convenient and we didn't want everything rolled up in one processor. Having
>> said that I can definitely merge the three into ONE processor. But my
>> question is, what are all the things (obviously generically speaking) I
>> need to be concerned about/ look into to make this a map reduce job? I am
>> asking for pointers on where to even start here.
>>
>> Lets say, all my processing is done in mappers. So my input will be
>> MyObjectA and my output will be MyObjectD from each mapper. And then my
>> reducer simple writes all MyObjectD objects to accumulo. Is achieving this
>> as easy as just submitting the jar to hadoop ????
>>
>> I guess overall, I want to know how does one go about blindly submitting
>> a .jar (java apps) and make this a map reduce task.
>> We are going this route, because multi-threading won't solve our problem.
>> We have to process objects in batch now and they are coming in every
>> minute.
>>
>> Thank you in advance for any and all help.
>>
>>
>>

Re: Java jars and MapReduce

Posted by Aji Janis <aj...@gmail.com>.

Steve,

Thank you for the response. Please see below.

1) Java processing is the rate limiting step. Are you saying Accumulo isn't
designed to run mapreduce ? Can you elaborate please.

I absolutely agree that my implementation should be a chain of map reduce
jobs. Unfortunately, at this time there are reasons to roll up all the
processing steps into ONE JAR. Going forward I see that to not be the case.
Having said that I think two of the interesting challenges would be solr
indexing MyObjectD as its ready. Do you have any insights into how well
solr indexing works with mapreduce? Issues? Workarounds?

Another interesting thing is I would like to use Log4j for periodic (and a
lot of ) logging. Can I do this? If yes, where are these files written?

Thank you again for  your feedback.

On Fri, Mar 1, 2013 at 11:21 PM, Steve Lewis <lo...@gmail.com> wrote:

> A few basic questions -
> 1) is the rate limiting step the Java processing or storage in accumulo.
> Hadoop may not be able to speed up a database which is not designed to work
> in a distributed manner.
>
> B)  Can ObjectD or any intermediate objects be serialized  possibly to xml
> and efficiently deserialized. If so they can be passed outside the databse.
> If so you might consider passing the serialized form and maybe several jobs
> in series.
>
> C) Alternatively if the object can be defined by a single identifier in
> the database and intermediate steps are in the database and if the
> processing and not the database is the rate limiting step then every step
> could be a separate job with a custom InputFormat passing ids to the mapper.
>
> You need to spend a lot of time thinking about which steps in the
> processing are rate limiting and how much of the performance bottlenecks
> arein the database.
>
> Steven M. Lewis PhD
>  On Mar 1, 2013 9:48 AM, "Aji Janis" <aj...@gmail.com> wrote:
>
>> Hello,
>>
>> Current Design: I have a java object MyObjectA. MyObjectA goes through
>> Three processors (jars) that are run in sequence and do a lot of processing
>> to beef up A with tons of additional stuff (think ETL) and the final result
>> is MyObjectD (note: MyObjectD is really A with more fields if you will
>> added to it but I wanted to clarify here that they are very different).
>> MyObjectD when ready is saved to my non relational database (accumulo).
>> Currently, all this is done by making use of Quartz Scheduler - a
>> List<MyObjectA> is submitted for processing every N mintues. Everything is
>> written in Java and there is a lot of talking back n forth with Accumulo
>> (to access tables that will help convert A to D).
>>
>> We split the processing into three processors just because it was more
>> convenient and we didn't want everything rolled up in one processor. Having
>> said that I can definitely merge the three into ONE processor. But my
>> question is, what are all the things (obviously generically speaking) I
>> need to be concerned about/ look into to make this a map reduce job? I am
>> asking for pointers on where to even start here.
>>
>> Lets say, all my processing is done in mappers. So my input will be
>> MyObjectA and my output will be MyObjectD from each mapper. And then my
>> reducer simple writes all MyObjectD objects to accumulo. Is achieving this
>> as easy as just submitting the jar to hadoop ????
>>
>> I guess overall, I want to know how does one go about blindly submitting
>> a .jar (java apps) and make this a map reduce task.
>> We are going this route, because multi-threading won't solve our problem.
>> We have to process objects in batch now and they are coming in every
>> minute.
>>
>> Thank you in advance for any and all help.
>>
>>
>>

Re: Java jars and MapReduce

Posted by Steve Lewis <lo...@gmail.com>.

A few basic questions -
1) is the rate limiting step the Java processing or storage in accumulo.
Hadoop may not be able to speed up a database which is not designed to work
in a distributed manner.

B)  Can ObjectD or any intermediate objects be serialized  possibly to xml
and efficiently deserialized. If so they can be passed outside the databse.
If so you might consider passing the serialized form and maybe several jobs
in series.

C) Alternatively if the object can be defined by a single identifier in the
database and intermediate steps are in the database and if the processing
and not the database is the rate limiting step then every step could be a
separate job with a custom InputFormat passing ids to the mapper.

You need to spend a lot of time thinking about which steps in the
processing are rate limiting and how much of the performance bottlenecks
arein the database.

Steven M. Lewis PhD
 On Mar 1, 2013 9:48 AM, "Aji Janis" <aj...@gmail.com> wrote:

> Hello,
>
> Current Design: I have a java object MyObjectA. MyObjectA goes through
> Three processors (jars) that are run in sequence and do a lot of processing
> to beef up A with tons of additional stuff (think ETL) and the final result
> is MyObjectD (note: MyObjectD is really A with more fields if you will
> added to it but I wanted to clarify here that they are very different).
> MyObjectD when ready is saved to my non relational database (accumulo).
> Currently, all this is done by making use of Quartz Scheduler - a
> List<MyObjectA> is submitted for processing every N mintues. Everything is
> written in Java and there is a lot of talking back n forth with Accumulo
> (to access tables that will help convert A to D).
>
> We split the processing into three processors just because it was more
> convenient and we didn't want everything rolled up in one processor. Having
> said that I can definitely merge the three into ONE processor. But my
> question is, what are all the things (obviously generically speaking) I
> need to be concerned about/ look into to make this a map reduce job? I am
> asking for pointers on where to even start here.
>
> Lets say, all my processing is done in mappers. So my input will be
> MyObjectA and my output will be MyObjectD from each mapper. And then my
> reducer simple writes all MyObjectD objects to accumulo. Is achieving this
> as easy as just submitting the jar to hadoop ????
>
> I guess overall, I want to know how does one go about blindly submitting a
> .jar (java apps) and make this a map reduce task.
> We are going this route, because multi-threading won't solve our problem.
> We have to process objects in batch now and they are coming in every
> minute.
>
> Thank you in advance for any and all help.
>
>
>

Re: Java jars and MapReduce

Posted by Steve Lewis <lo...@gmail.com>.

A few basic questions -
1) is the rate limiting step the Java processing or storage in accumulo.
Hadoop may not be able to speed up a database which is not designed to work
in a distributed manner.

B)  Can ObjectD or any intermediate objects be serialized  possibly to xml
and efficiently deserialized. If so they can be passed outside the databse.
If so you might consider passing the serialized form and maybe several jobs
in series.

C) Alternatively if the object can be defined by a single identifier in the
database and intermediate steps are in the database and if the processing
and not the database is the rate limiting step then every step could be a
separate job with a custom InputFormat passing ids to the mapper.

You need to spend a lot of time thinking about which steps in the
processing are rate limiting and how much of the performance bottlenecks
arein the database.

Steven M. Lewis PhD
 On Mar 1, 2013 9:48 AM, "Aji Janis" <aj...@gmail.com> wrote:

> Hello,
>
> Current Design: I have a java object MyObjectA. MyObjectA goes through
> Three processors (jars) that are run in sequence and do a lot of processing
> to beef up A with tons of additional stuff (think ETL) and the final result
> is MyObjectD (note: MyObjectD is really A with more fields if you will
> added to it but I wanted to clarify here that they are very different).
> MyObjectD when ready is saved to my non relational database (accumulo).
> Currently, all this is done by making use of Quartz Scheduler - a
> List<MyObjectA> is submitted for processing every N mintues. Everything is
> written in Java and there is a lot of talking back n forth with Accumulo
> (to access tables that will help convert A to D).
>
> We split the processing into three processors just because it was more
> convenient and we didn't want everything rolled up in one processor. Having
> said that I can definitely merge the three into ONE processor. But my
> question is, what are all the things (obviously generically speaking) I
> need to be concerned about/ look into to make this a map reduce job? I am
> asking for pointers on where to even start here.
>
> Lets say, all my processing is done in mappers. So my input will be
> MyObjectA and my output will be MyObjectD from each mapper. And then my
> reducer simple writes all MyObjectD objects to accumulo. Is achieving this
> as easy as just submitting the jar to hadoop ????
>
> I guess overall, I want to know how does one go about blindly submitting a
> .jar (java apps) and make this a map reduce task.
> We are going this route, because multi-threading won't solve our problem.
> We have to process objects in batch now and they are coming in every
> minute.
>
> Thank you in advance for any and all help.
>
>
>

Re: Java jars and MapReduce

Posted by Steve Lewis <lo...@gmail.com>.

A few basic questions -
1) is the rate limiting step the Java processing or storage in accumulo.
Hadoop may not be able to speed up a database which is not designed to work
in a distributed manner.

B)  Can ObjectD or any intermediate objects be serialized  possibly to xml
and efficiently deserialized. If so they can be passed outside the databse.
If so you might consider passing the serialized form and maybe several jobs
in series.

C) Alternatively if the object can be defined by a single identifier in the
database and intermediate steps are in the database and if the processing
and not the database is the rate limiting step then every step could be a
separate job with a custom InputFormat passing ids to the mapper.

You need to spend a lot of time thinking about which steps in the
processing are rate limiting and how much of the performance bottlenecks
arein the database.

Steven M. Lewis PhD
 On Mar 1, 2013 9:48 AM, "Aji Janis" <aj...@gmail.com> wrote:

> Hello,
>
> Current Design: I have a java object MyObjectA. MyObjectA goes through
> Three processors (jars) that are run in sequence and do a lot of processing
> to beef up A with tons of additional stuff (think ETL) and the final result
> is MyObjectD (note: MyObjectD is really A with more fields if you will
> added to it but I wanted to clarify here that they are very different).
> MyObjectD when ready is saved to my non relational database (accumulo).
> Currently, all this is done by making use of Quartz Scheduler - a
> List<MyObjectA> is submitted for processing every N mintues. Everything is
> written in Java and there is a lot of talking back n forth with Accumulo
> (to access tables that will help convert A to D).
>
> We split the processing into three processors just because it was more
> convenient and we didn't want everything rolled up in one processor. Having
> said that I can definitely merge the three into ONE processor. But my
> question is, what are all the things (obviously generically speaking) I
> need to be concerned about/ look into to make this a map reduce job? I am
> asking for pointers on where to even start here.
>
> Lets say, all my processing is done in mappers. So my input will be
> MyObjectA and my output will be MyObjectD from each mapper. And then my
> reducer simple writes all MyObjectD objects to accumulo. Is achieving this
> as easy as just submitting the jar to hadoop ????
>
> I guess overall, I want to know how does one go about blindly submitting a
> .jar (java apps) and make this a map reduce task.
> We are going this route, because multi-threading won't solve our problem.
> We have to process objects in batch now and they are coming in every
> minute.
>
> Thank you in advance for any and all help.
>
>
>

Re: Java jars and MapReduce

Posted by Steve Lewis <lo...@gmail.com>.

A few basic questions -
1) is the rate limiting step the Java processing or storage in accumulo.
Hadoop may not be able to speed up a database which is not designed to work
in a distributed manner.

B)  Can ObjectD or any intermediate objects be serialized  possibly to xml
and efficiently deserialized. If so they can be passed outside the databse.
If so you might consider passing the serialized form and maybe several jobs
in series.

C) Alternatively if the object can be defined by a single identifier in the
database and intermediate steps are in the database and if the processing
and not the database is the rate limiting step then every step could be a
separate job with a custom InputFormat passing ids to the mapper.

You need to spend a lot of time thinking about which steps in the
processing are rate limiting and how much of the performance bottlenecks
arein the database.

Steven M. Lewis PhD
 On Mar 1, 2013 9:48 AM, "Aji Janis" <aj...@gmail.com> wrote:

> Hello,
>
> Current Design: I have a java object MyObjectA. MyObjectA goes through
> Three processors (jars) that are run in sequence and do a lot of processing
> to beef up A with tons of additional stuff (think ETL) and the final result
> is MyObjectD (note: MyObjectD is really A with more fields if you will
> added to it but I wanted to clarify here that they are very different).
> MyObjectD when ready is saved to my non relational database (accumulo).
> Currently, all this is done by making use of Quartz Scheduler - a
> List<MyObjectA> is submitted for processing every N mintues. Everything is
> written in Java and there is a lot of talking back n forth with Accumulo
> (to access tables that will help convert A to D).
>
> We split the processing into three processors just because it was more
> convenient and we didn't want everything rolled up in one processor. Having
> said that I can definitely merge the three into ONE processor. But my
> question is, what are all the things (obviously generically speaking) I
> need to be concerned about/ look into to make this a map reduce job? I am
> asking for pointers on where to even start here.
>
> Lets say, all my processing is done in mappers. So my input will be
> MyObjectA and my output will be MyObjectD from each mapper. And then my
> reducer simple writes all MyObjectD objects to accumulo. Is achieving this
> as easy as just submitting the jar to hadoop ????
>
> I guess overall, I want to know how does one go about blindly submitting a
> .jar (java apps) and make this a map reduce task.
> We are going this route, because multi-threading won't solve our problem.
> We have to process objects in batch now and they are coming in every
> minute.
>
> Thank you in advance for any and all help.
>
>
>