You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Guy Doulberg <Gu...@conduit.com> on 2011/04/07 09:39:59 UTC

Developing, Testing, Distributing

Hey,

I have been developing Map/Red jars for a while now, and I am still not comfortable with the developing environment I gathered for myself (and the team)

I am curious how other Hadoop developers out-there, are developing their jobs...

What IDE you are using,
What plugins to the IDE you are using
How do you test your code, which Unit test libraries your using, how do you run your automatic tests after you have finished the development?
Do you have test/qa/staging environments beside the dev and the production? How do you keep it similar to the production
Code reuse - how do you build components that can be used in other jobs, do you build generic map or reduce class?

I can tell you that I have no answer to the questions above,

I hope that this post is not too general, but I think the discussion here could be helpful for newbie and experienced developers all together

Thanks Guy

Re: Developing, Testing, Distributing

Posted by Chris K Wensel <ch...@wensel.net>.
> But when I tried to implement a real life project, things has become too complicated to me, things didn't go the way I expected them to go, 
> I  had to implement it using plain map/red....api


as with any framework (in Java), it takes time to find the best practices. many of which are documented in the user guide.

that said, a number of projects on top of Cascading simplify development, namely the JRuby and Clojure integrations and query languages.
http://www.cascading.org/modules.html

and don't forget you can still run raw MR jobs in tandem with Cascading flows, so no need to rewrite working apps.

don't hesitate to ask questions on the list or IRC channel (#cascading)

chris

--
Chris K Wensel
chris@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support for Cascading


Re: Developing, Testing, Distributing

Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.
Did you try Pig ? It drastically reduces boiler plate code for common operations like, Join, Group, Cogroup, Filter, Projection, Order
Pig also gives you some advanced stuff like Multi-query optimization that is ugly to code my hand and difficult to maintain.

Most of the people that I know, don't write Map-reduce code anymore as its too low level and simply too much work.

If you are looking for alternative stuff to solve your problems better, I suggest you give Pig (version 0.8) a shot - http://pig.apache.org

Regards
-@nkur

On 4/7/11 9:48 AM, "Guy Doulberg" <Gu...@conduit.com> wrote:

Thanks for your answers,

I checked cascading for a while,
It was easy to get started and to do the tutorial, I really liked the modeling of pipes, cogroups and so on...

But when I tried to implement a real life project, things has become too complicated to me, things didn't go the way I expected them to go,
I  had to implement it using plain map/red....api

I think I should give it a try again.



-----Original Message-----
From: Guy Doulberg [mailto:Guy.Doulberg@conduit.com]
Sent: Thursday, April 07, 2011 10:40 AM
To: common-user@hadoop.apache.org
Subject: Developing, Testing, Distributing

Hey,

I have been developing Map/Red jars for a while now, and I am still not comfortable with the developing environment I gathered for myself (and the team)

I am curious how other Hadoop developers out-there, are developing their jobs...

What IDE you are using,
What plugins to the IDE you are using
How do you test your code, which Unit test libraries your using, how do you run your automatic tests after you have finished the development?
Do you have test/qa/staging environments beside the dev and the production? How do you keep it similar to the production
Code reuse - how do you build components that can be used in other jobs, do you build generic map or reduce class?

I can tell you that I have no answer to the questions above,

I hope that this post is not too general, but I think the discussion here could be helpful for newbie and experienced developers all together

Thanks Guy


RE: Developing, Testing, Distributing

Posted by Guy Doulberg <Gu...@conduit.com>.
Thanks for your answers, 

I checked cascading for a while, 
It was easy to get started and to do the tutorial, I really liked the modeling of pipes, cogroups and so on...

But when I tried to implement a real life project, things has become too complicated to me, things didn't go the way I expected them to go, 
I  had to implement it using plain map/red....api

I think I should give it a try again. 



-----Original Message-----
From: Guy Doulberg [mailto:Guy.Doulberg@conduit.com] 
Sent: Thursday, April 07, 2011 10:40 AM
To: common-user@hadoop.apache.org
Subject: Developing, Testing, Distributing 

Hey,

I have been developing Map/Red jars for a while now, and I am still not comfortable with the developing environment I gathered for myself (and the team)

I am curious how other Hadoop developers out-there, are developing their jobs...

What IDE you are using,
What plugins to the IDE you are using
How do you test your code, which Unit test libraries your using, how do you run your automatic tests after you have finished the development?
Do you have test/qa/staging environments beside the dev and the production? How do you keep it similar to the production
Code reuse - how do you build components that can be used in other jobs, do you build generic map or reduce class?

I can tell you that I have no answer to the questions above,

I hope that this post is not too general, but I think the discussion here could be helpful for newbie and experienced developers all together

Thanks Guy

Re: Developing, Testing, Distributing

Posted by David Rosenstrauch <da...@darose.net>.
On 04/07/2011 03:39 AM, Guy Doulberg wrote:
> Hey,
>
> I have been developing Map/Red jars for a while now, and I am still not comfortable with the developing environment I gathered for myself (and the team)
>
> I am curious how other Hadoop developers out-there, are developing their jobs...
>
> What IDE you are using,

Eclipse

> What plugins to the IDE you are using

Um .... subclipse.  (And findbugs sometimes.)

> How do you test your code, which Unit test libraries your using, how do you run your automatic tests after you have finished the development?

JUnit.  Run the tests right inside eclipse using the IDE's built-in 
junit capabilities.

> Do you have test/qa/staging environments beside the dev and the production? How do you keep it similar to the production

We have small dev and qa Hadoop clusters, in addition to the large 
production cluster.  We don't do anything particular to keep them 
similar.  If you want to run a test job, and require some data that's on 
the prod cluster, you have to port it yourself.

> Code reuse - how do you build components that can be used in other jobs, do you build generic map or reduce class?

If you do Test Driven Development when you write your code, you wind up 
with components that you can test independently, and then plug into your 
M/R classes.

> I can tell you that I have no answer to the questions above,
>
> I hope that this post is not too general, but I think the discussion here could be helpful for newbie and experienced developers all together
>
> Thanks Guy
>


Re: Developing, Testing, Distributing

Posted by Chris K Wensel <ch...@wensel.net>.
> How do you test your code, which Unit test libraries your using, how do you run your automatic tests after you have finished the development?
> Do you have test/qa/staging environments beside the dev and the production? How do you keep it similar to the production
> Code reuse - how do you build components that can be used in other jobs, do you build generic map or reduce class?


In all honesty you should take a look at Cascading. It was designed to simplify this, but keep in mind i'm the project lead so biased.

In Cascading, there are three distinct elements that can be tested independently.

- operations, things like functions and filters. that can typically be re-used in any cascading app.
- assemblies of operations that constitute a unit of work or some algorithmic process (this will become 1 or more MR jobs during runtime)
- taps, the things that talk to HDFS or external systems like HBase, CouchBase, MySQL, ElasticSearch, etc.

each of these can be unit tested individually or as a whole. and you can make libraries or frameworks usable by other developers on your teams.

the real value is that you no longer need to think in MapReduce when developing, just the problem domain. 

and you can test your processing app independently of making it work in staging or production just by swapping out taps.

http://www.cascading.org/

btw, I use IntelliJ for all my development. 

cheers,
chris

--
Chris K Wensel
chris@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support for Cascading


‏‏RE: Developing, Testing, Distributing

Posted by Guy Doulberg <Gu...@conduit.com>.
Thanks, I think I will try your way of developing (replacing the ant)
________________________________________
‏‏מאת: Tsz Wo (Nicholas), Sze [s29752-hadoopgeneral@yahoo.com]
‏‏נשלח: ‏‏יום שישי 08 אפריל 2011 21:08
‏‏אל: common-user@hadoop.apache.org
‏‏נושא: Re: Developing, Testing, Distributing

First of all, I am a Hadoop contributor and I am familiar with the Hadoop code
base/build mechanism.  Here is what I do:


Q1: What IDE you are using,
Eclipse.

Q2: What plugins to the IDE you are using
No plugins.

Q3:  How do you test your code, which Unit test libraries your using, how do
you run your automatic tests after you have finished the development?
I use JUnit.  The tests are executed using ant, the same way for what we did in
Hadoop development.

Q4: Do you have test/qa/staging environments beside the dev and the production?
How do you keep it similar to the production
We, Yahoo!, have test clusters which have similar settings as production
cluster.

Q5: Code reuse - how do you build components that can be used in other jobs, do
you build generic map or reduce class?
I do have my own framework for running generic computations or generic jobs.

Some more details:
1) svn checkout MapReduce trunk (or common/branches/branch-0.20 for 0.20)
2) compile everything using ant
3) setup eclipse
4) remove existing files under ./src/examples
5) develop my codes under ./src/examples
6) add unit tests under ./src/test/mapred

I find it very convenient since (i) the build scripts could compile the examples
code, run unit test, create jar, etc., and (ii) Hadoop contributors would
maintain it.

Hope it helps.
Nicholas Sze

Re: Developing, Testing, Distributing

Posted by "Tsz Wo (Nicholas), Sze" <s2...@yahoo.com>.
First of all, I am a Hadoop contributor and I am familiar with the Hadoop code 
base/build mechanism.  Here is what I do:


Q1: What IDE you are using,
Eclipse.

Q2: What plugins to the IDE you are using
No plugins.

Q3:  How do you test your code, which Unit test libraries your using, how do  
you run your automatic tests after you have finished the development?
I use JUnit.  The tests are executed using ant, the same way for what we did in 
Hadoop development.

Q4: Do you have test/qa/staging environments beside the dev and the production? 
How do you keep it similar to the production
We, Yahoo!, have test clusters which have similar settings as production 
cluster.

Q5: Code reuse - how do you build components that can be used in other jobs, do 
you build generic map or reduce class?
I do have my own framework for running generic computations or generic jobs.

Some more details:
1) svn checkout MapReduce trunk (or common/branches/branch-0.20 for 0.20)
2) compile everything using ant
3) setup eclipse
4) remove existing files under ./src/examples 
5) develop my codes under ./src/examples
6) add unit tests under ./src/test/mapred

I find it very convenient since (i) the build scripts could compile the examples 
code, run unit test, create jar, etc., and (ii) Hadoop contributors would 
maintain it.

Hope it helps.
Nicholas Sze

Re: Developing, Testing, Distributing

Posted by "W.P. McNeill" <bi...@gmail.com>.
I use IntelliJ, though Eclipse works too.  I don't have any Hadoop-specific
plug-ins; both IDEs are just set up as vanilla Java programming
environments.

Chapter 5 of *Hadoop: the Definitive
Guide<http://www.librarything.com/work/8488103>
 *has a good overview of testing methodology. It's what I follow. I always
run code in local single-JVM mode so that I can step through it in the IDE's
debugger. Only when I've got that working do I try to deploy to a cluster.
For debugging scale-up bugs that only happen on the cluster I rely on Log4j
logging, though I add code that allows the log level to be set via a Hadoop
configuration parameter so that I can run different jobs at different log
levels on the same cluster. I also do my best to factor my logic apart from
the Hadoop boilerplate so that I can unit test the former with Juint4. More
details on my testing methodology here:
http://cornercases.wordpress.com/2011/04/08/unit-testing-mapreduce-with-the-adapter-pattern/
.

In my organization we have a single cluster that is used for
research,testing, and production work. This works fine.  Just make sure to
set up HDFS permissions so that you don't accidentally delete work. We also
have a separate one-node cluster that is used for controlled measurement of
wall-clock performance of Hadoop jobs.

We write directly to the Map/Reduce interface instead of using higher level
tools like Pig or Cascade. Those higher level tools look like they would be
helpful, but no one has had the time to learn how to use them yet. All the
code reuse techniques I employ are described in the link above. For each job
I end up directly subclassing Hadoop's Mapper and Reducer classes. I find
those to already be at the right level of generality and haven't had cause
to add any further encapsulation.


On Thu, Apr 7, 2011 at 12:39 AM, Guy Doulberg <Gu...@conduit.com>wrote:

> Hey,
>
> I have been developing Map/Red jars for a while now, and I am still not
> comfortable with the developing environment I gathered for myself (and the
> team)
>
> I am curious how other Hadoop developers out-there, are developing their
> jobs...
>
> What IDE you are using,
> What plugins to the IDE you are using
> How do you test your code, which Unit test libraries your using, how do you
> run your automatic tests after you have finished the development?
> Do you have test/qa/staging environments beside the dev and the production?
> How do you keep it similar to the production
> Code reuse - how do you build components that can be used in other jobs, do
> you build generic map or reduce class?
>
> I can tell you that I have no answer to the questions above,
>
> I hope that this post is not too general, but I think the discussion here
> could be helpful for newbie and experienced developers all together
>
> Thanks Guy
>