You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Erik Paulson <ep...@unit1127.com> on 2013/01/16 00:50:17 UTC

development environment for hadoop core

Hello -

I'm curious what Hadoop developers use for their day-to-day hacking on
Hadoop. I'm talking changes to the Hadoop libraries and daemons, and not
developing Map-Reduce jobs or using using the HDFS Client libraries to talk
to a filesystem from an application.

I've checked out Hadoop, made minor changes and built it with Maven, and
tracked down the resulting artifacts in a target/ directory that I could
deploy. Is this typically how a cloudera/hortonworks/mapr/etc dev works, or
are the IDEs more common?

I realize this sort of sounds like a dumb question, but I'm mostly curious
what I might be missing out on if I stay away from anything other than vim,
and not being entirely sure where maven might be caching jars that it uses
to build, and how careful I have to be to ensure that my changes wind up in
the right places without having to do a clean build every time.

Thanks!

-Erik

Re: Hive utf8

Posted by Nitin Pawar <ni...@gmail.com>.

forgot to add JIRA

https://issues.apache.org/jira/browse/HIVE-1505


On Wed, Jan 16, 2013 at 1:25 PM, Nitin Pawar <ni...@gmail.com>wrote:

> you may be running into this
>
>
>
>
> On Wed, Jan 16, 2013 at 7:24 AM, springring <sp...@126.com> wrote:
>
>> Hi,
>>
>>    I put some file include chinese into HDFS.
>>    And read the file as "hadoop fs -cat /user/hive/warehouse/..." ,  is
>> ok, I can see the chinese.
>>
>>   But when I open the table in hive, I can't read chinese........(english
>> is ok )why?
>>
>
>
>
> --
> Nitin Pawar
>



-- 
Nitin Pawar

Re: Hive utf8

Posted by Nitin Pawar <ni...@gmail.com>.

you may be running into this




On Wed, Jan 16, 2013 at 7:24 AM, springring <sp...@126.com> wrote:

> Hi,
>
>    I put some file include chinese into HDFS.
>    And read the file as "hadoop fs -cat /user/hive/warehouse/..." ,  is
> ok, I can see the chinese.
>
>   But when I open the table in hive, I can't read chinese........(english
> is ok )why?
>



-- 
Nitin Pawar

Hive utf8

Posted by springring <sp...@126.com>.

Hi,

   I put some file include chinese into HDFS.
   And read the file as "hadoop fs -cat /user/hive/warehouse/..." ,  is ok, I can see the chinese.

  But when I open the table in hive, I can't read chinese........(english is ok )why?

Re: development environment for hadoop core

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Erik,

When I started out on Hadoop development, I used to use emacs for most of
my development. I eventually "saw the light" and switched to eclipse with a
bunch of emacs keybindings - using an IDE is really handy in Java for
functions like "find callers of", quick navigation to types, etc. etags
gets you part of the way, but I'm pretty sold on eclipse at this point. The
other big advantage I found of Eclipse is that the turnaround time on
running tests is near-instant - make a change, hit save, and run a unit
test in a second or two, instead of waiting 20+sec for maven (even on a
non-clean build).

That said, for quick fixes or remote debugging work I fall back to vim
pretty quickly.

-Todd

On Tue, Jan 15, 2013 at 3:50 PM, Erik Paulson <ep...@unit1127.com> wrote:

> Hello -
>
> I'm curious what Hadoop developers use for their day-to-day hacking on
> Hadoop. I'm talking changes to the Hadoop libraries and daemons, and not
> developing Map-Reduce jobs or using using the HDFS Client libraries to talk
> to a filesystem from an application.
>
> I've checked out Hadoop, made minor changes and built it with Maven, and
> tracked down the resulting artifacts in a target/ directory that I could
> deploy. Is this typically how a cloudera/hortonworks/mapr/etc dev works, or
> are the IDEs more common?
>
> I realize this sort of sounds like a dumb question, but I'm mostly curious
> what I might be missing out on if I stay away from anything other than vim,
> and not being entirely sure where maven might be caching jars that it uses
> to build, and how careful I have to be to ensure that my changes wind up in
> the right places without having to do a clean build every time.
>
> Thanks!
>
> -Erik
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: development environment for hadoop core

Posted by Steve Loughran <st...@hortonworks.com>.

My setup ( I work from home)

# OS/X laptop w/ 30" monitor
# FTTC broadband, 55Mbit/s down, 15+ up -it's the upload bandwidth that
really helps development: http://www.flickr.com/photos/steve_l/8050751551/
# IntelliJ IDEA IDE, settings edited for a 2GB Heap
# Maven on the command line for builds
# I run a "mvn install -DskipTests" every morning to ensure that apache's
own -SNAPSHOT artifacts aren't pulled in.
# CentOS 6.3 VM for doing the full binary build & test, making my own RPMs,
etc.
# coffee.

One thing that annoys me is that I've got an airplay-driven hifi set up,
and during builds there's enough CPU/RAM load that the music has dropouts.
Whoever thought of streaming over UDP without an option for deeper
buffering clearly doesn't use maven.

What I am doing is moving my centos VM off the laptop and into rackspace
cloud. That saves RAM for the IDE, and as I'm testing things in the same
infrastructure, it gives me the ability to deploy artifacts at gigabit
rates. I just use git as a way of syncing source.

One thing I am debating -again on rackspace- is to set up Jenkins on
yet-another-VM, polling aggressively, and automatically running the full
test suite every half hour. That way, it does the full regression testing
on all changes on my branch, while I focus on the one or two tests that I
care about.

That's something I discussed a way back
https://docs.google.com/document/d/16v4SFYC6WSB-Y-B0Uo3IEhEhEN9pPYrtvhtR8KW9ouI/edit

-it's only now that I'm sitting down and really doing it -git & github
makes a difference as I can have my own personal branches for the CI
tooling to play with

Has anyone else tried anything like this

On 16 January 2013 00:50, Erik Paulson <ep...@unit1127.com> wrote:

> Hello -
>
> I'm curious what Hadoop developers use for their day-to-day hacking on
> Hadoop. I'm talking changes to the Hadoop libraries and daemons, and not
> developing Map-Reduce jobs or using using the HDFS Client libraries to talk
> to a filesystem from an application.
>
> I've checked out Hadoop, made minor changes and built it with Maven, and
> tracked down the resulting artifacts in a target/ directory that I could
> deploy. Is this typically how a cloudera/hortonworks/mapr/etc dev works, or
> are the IDEs more common?
>
> I realize this sort of sounds like a dumb question, but I'm mostly curious
> what I might be missing out on if I stay away from anything other than vim,
> and not being entirely sure where maven might be caching jars that it uses
> to build, and how careful I have to be to ensure that my changes wind up in
> the right places without having to do a clean build every time.
>
> Thanks!
>
> -Erik
>

Re: development environment for hadoop core

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

Hi Erik,

Eclipse can run junit tests very rapidly.  If you want a shorter test
cycle, that's one way to get it.

There is also Maven-shell, which reduces some of the overhead of starting
Maven.  But I haven't used it so I can't really comment.

cheers,
Colin


On Mon, Jan 21, 2013 at 8:36 AM, Erik Paulson <ep...@unit1127.com> wrote:

> On Wed, Jan 16, 2013 at 7:31 AM, Glen Mazza <gm...@talend.com> wrote:
>
> > On 01/15/2013 06:50 PM, Erik Paulson wrote:
> >
> >> Hello -
> >>
> >> I'm curious what Hadoop developers use for their day-to-day hacking on
> >> Hadoop. I'm talking changes to the Hadoop libraries and daemons, and not
> >> developing Map-Reduce jobs or using using the HDFS Client libraries to
> >> talk
> >> to a filesystem from an application.
> >>
> >> I've checked out Hadoop, made minor changes and built it with Maven, and
> >> tracked down the resulting artifacts in a target/ directory that I could
> >> deploy. Is this typically how a cloudera/hortonworks/mapr/etc dev works,
> >> or
> >> are the IDEs more common?
> >>
> > I haven't built Hadoop yet myself.  Your use of "a" in "a target/
> > directory" indicates you're also kind of new with Maven itself, as that's
> > the standard output folder for any Maven project.  One of many nice
> things
> > about Maven is once you learn how to build one project with it you pretty
> > much know how to build any project with it, as everything's standardized
> > with it.
> >
> > Probably best to stick with the command line for building and use Eclipse
> > for editing, to keep things simple, but don't forget the mvn
> > eclipse:eclipse command to set up Eclipse projects that you can
> > subsequently import into your Eclipse IDE:
> http://www.jroller.com/gmazza/*
> > *entry/web_service_tutorial#**EclipseSetup<
> http://www.jroller.com/gmazza/entry/web_service_tutorial#EclipseSetup>
> >
> >
> >
> >> I realize this sort of sounds like a dumb question, but I'm mostly
> curious
> >> what I might be missing out on if I stay away from anything other than
> >> vim,
> >> and not being entirely sure where maven might be caching jars that it
> uses
> >> to build,
> >>
> >
> > That will be your local Maven repository, in an .m2 hidden folder in your
> > user home directory.
> >
> >
> >
> >  and how careful I have to be to ensure that my changes wind up in
> >> the right places without having to do a clean build every time.
> >>
> >>
> > Maven can detect changes (using mvn install instead of mvn clean
> install),
> > but I prefer doing clean builds.  You can use the -Dmaven.test.skip
> setting
> > to speed up your "mvn clean installs" if you don't wish to run the tests
> > each time.
> >
>
> Thanks to everyone for their advice last week, it's been helpful.
>
> You're spot-on that I'm new to Maven, but I'm a little confused as to what
> the different targets/goals are best to use. Here's my scenario.
>
> What I'd like to get working is the DataNodeCluster, which lives in the
> tests.
>
> Running it from hadoop-hdfs-project/hadoop-hdfs/target as
> 'hadoop jar ./hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar
> org.apache.hadoop.hdfs.DataNodeCluster
> -n 2'
>
> blows up with a NPE inside of MiniDFSCluster - the offending line is
> 'dfsdir = conf.get(HDFS_MINIDFS_BASEDIR, null);' (line 2078 of
> MiniDFSCluster.java)
>
> I'm not worried about being able to figure out what's wrong (I'm pretty
> sure it's that conf is still null when this gets called) - what I'm trying
> to use this as is a way to understand what gets built when.
>
> Just to check, I added a System.out.println one line before 2078 of
> MiniDFSCluster, and recompiled from hadoop-common/hadoop-hdfs-project with
>
> mvn package -DskipTests
>
> Because I don't want to run all the tests.
>
> This certainly compiles the codes - if I leave the semicolon off of my
> change the compile fails, even with -DskipTests. However, it doesn't appear
> to rebuild
>
> target/hadoop-hdfs-3.0.0-SNAPSHOT/share/hadoop/hdfs/hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar
> - the timestamp is still the old version.
>
> It _does_ copy
>
> target/hadoop-hdfs-3.0.0-SNAPSHOT/share/hadoop/hdfs/hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar
> to target/hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar, or at least otherwise
> update the timestamp on target/hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar (unless
> it's copying or building it from somewhere else - but if it is, it's
> picking up old versions of my code)
>
> I only get an updated version if I ask for
> mvn package -Pdist -DskipTests
>
> Which is a 3 minute rebuild cycle, even for something as simple as changing
> the text in my System.out.println. (Even just a mvn package -DskipTests
> with no changes to any source code is a 40 second operation)
>
> I haven't sat around and waited for 'mvn package' to run and fire off the
> test suite. I don't know if that would result in an updated
> hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar
> being built.
>
> So, my question is:
>
> - Is there a better maven target to use if I just want to update code in
> MiniDFSCluster.java and run DataNodeCluster, all of which wind up in
> -tests.jar? ('better' here means a shorter build cycle. I'm a terrible
> programmer so finding errors quickly is a priority for me :)
> - is it worth being concerned that 'mvn package' on what should be a no-op
> takes as long as it does?
>
> I'll sort out the NPE in Datanodecluster and file appropriate JIRAs. (This
> is all on the trunk - git show-ref is
> 2fc22342f44055ae4a2b526408de7524bf1f9215 HEAD, so the trunk as of last
> Wednesday)
>
> Thanks!
>
> -Erik
>

Re: development environment for hadoop core

Posted by Erik Paulson <ep...@unit1127.com>.

On Wed, Jan 16, 2013 at 7:31 AM, Glen Mazza <gm...@talend.com> wrote:

> On 01/15/2013 06:50 PM, Erik Paulson wrote:
>
>> Hello -
>>
>> I'm curious what Hadoop developers use for their day-to-day hacking on
>> Hadoop. I'm talking changes to the Hadoop libraries and daemons, and not
>> developing Map-Reduce jobs or using using the HDFS Client libraries to
>> talk
>> to a filesystem from an application.
>>
>> I've checked out Hadoop, made minor changes and built it with Maven, and
>> tracked down the resulting artifacts in a target/ directory that I could
>> deploy. Is this typically how a cloudera/hortonworks/mapr/etc dev works,
>> or
>> are the IDEs more common?
>>
> I haven't built Hadoop yet myself.  Your use of "a" in "a target/
> directory" indicates you're also kind of new with Maven itself, as that's
> the standard output folder for any Maven project.  One of many nice things
> about Maven is once you learn how to build one project with it you pretty
> much know how to build any project with it, as everything's standardized
> with it.
>
> Probably best to stick with the command line for building and use Eclipse
> for editing, to keep things simple, but don't forget the mvn
> eclipse:eclipse command to set up Eclipse projects that you can
> subsequently import into your Eclipse IDE: http://www.jroller.com/gmazza/*
> *entry/web_service_tutorial#**EclipseSetup<http://www.jroller.com/gmazza/entry/web_service_tutorial#EclipseSetup>
>
>
>
>> I realize this sort of sounds like a dumb question, but I'm mostly curious
>> what I might be missing out on if I stay away from anything other than
>> vim,
>> and not being entirely sure where maven might be caching jars that it uses
>> to build,
>>
>
> That will be your local Maven repository, in an .m2 hidden folder in your
> user home directory.
>
>
>
>  and how careful I have to be to ensure that my changes wind up in
>> the right places without having to do a clean build every time.
>>
>>
> Maven can detect changes (using mvn install instead of mvn clean install),
> but I prefer doing clean builds.  You can use the -Dmaven.test.skip setting
> to speed up your "mvn clean installs" if you don't wish to run the tests
> each time.
>

Thanks to everyone for their advice last week, it's been helpful.

You're spot-on that I'm new to Maven, but I'm a little confused as to what
the different targets/goals are best to use. Here's my scenario.

What I'd like to get working is the DataNodeCluster, which lives in the
tests.

Running it from hadoop-hdfs-project/hadoop-hdfs/target as
'hadoop jar ./hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar
org.apache.hadoop.hdfs.DataNodeCluster
-n 2'

blows up with a NPE inside of MiniDFSCluster - the offending line is
'dfsdir = conf.get(HDFS_MINIDFS_BASEDIR, null);' (line 2078 of
MiniDFSCluster.java)

I'm not worried about being able to figure out what's wrong (I'm pretty
sure it's that conf is still null when this gets called) - what I'm trying
to use this as is a way to understand what gets built when.

Just to check, I added a System.out.println one line before 2078 of
MiniDFSCluster, and recompiled from hadoop-common/hadoop-hdfs-project with

mvn package -DskipTests

Because I don't want to run all the tests.

This certainly compiles the codes - if I leave the semicolon off of my
change the compile fails, even with -DskipTests. However, it doesn't appear
to rebuild
target/hadoop-hdfs-3.0.0-SNAPSHOT/share/hadoop/hdfs/hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar
- the timestamp is still the old version.

It _does_ copy
target/hadoop-hdfs-3.0.0-SNAPSHOT/share/hadoop/hdfs/hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar
to target/hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar, or at least otherwise
update the timestamp on target/hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar (unless
it's copying or building it from somewhere else - but if it is, it's
picking up old versions of my code)

I only get an updated version if I ask for
mvn package -Pdist -DskipTests

Which is a 3 minute rebuild cycle, even for something as simple as changing
the text in my System.out.println. (Even just a mvn package -DskipTests
with no changes to any source code is a 40 second operation)

I haven't sat around and waited for 'mvn package' to run and fire off the
test suite. I don't know if that would result in an updated
hadoop-hdfs-3.0.0-SNAPSHOT-tests.jar
being built.

So, my question is:

- Is there a better maven target to use if I just want to update code in
MiniDFSCluster.java and run DataNodeCluster, all of which wind up in
-tests.jar? ('better' here means a shorter build cycle. I'm a terrible
programmer so finding errors quickly is a priority for me :)
- is it worth being concerned that 'mvn package' on what should be a no-op
takes as long as it does?

I'll sort out the NPE in Datanodecluster and file appropriate JIRAs. (This
is all on the trunk - git show-ref is
2fc22342f44055ae4a2b526408de7524bf1f9215 HEAD, so the trunk as of last
Wednesday)

Thanks!

-Erik

Re: development environment for hadoop core

Posted by Glen Mazza <gm...@talend.com>.

On 01/15/2013 06:50 PM, Erik Paulson wrote:
> Hello -
>
> I'm curious what Hadoop developers use for their day-to-day hacking on
> Hadoop. I'm talking changes to the Hadoop libraries and daemons, and not
> developing Map-Reduce jobs or using using the HDFS Client libraries to talk
> to a filesystem from an application.
>
> I've checked out Hadoop, made minor changes and built it with Maven, and
> tracked down the resulting artifacts in a target/ directory that I could
> deploy. Is this typically how a cloudera/hortonworks/mapr/etc dev works, or
> are the IDEs more common?
I haven't built Hadoop yet myself.  Your use of "a" in "a target/ 
directory" indicates you're also kind of new with Maven itself, as 
that's the standard output folder for any Maven project.  One of many 
nice things about Maven is once you learn how to build one project with 
it you pretty much know how to build any project with it, as 
everything's standardized with it.

Probably best to stick with the command line for building and use 
Eclipse for editing, to keep things simple, but don't forget the mvn 
eclipse:eclipse command to set up Eclipse projects that you can 
subsequently import into your Eclipse IDE: 
http://www.jroller.com/gmazza/entry/web_service_tutorial#EclipseSetup

>
> I realize this sort of sounds like a dumb question, but I'm mostly curious
> what I might be missing out on if I stay away from anything other than vim,
> and not being entirely sure where maven might be caching jars that it uses
> to build,

That will be your local Maven repository, in an .m2 hidden folder in 
your user home directory.


> and how careful I have to be to ensure that my changes wind up in
> the right places without having to do a clean build every time.
>

Maven can detect changes (using mvn install instead of mvn clean 
install), but I prefer doing clean builds.  You can use the 
-Dmaven.test.skip setting to speed up your "mvn clean installs" if you 
don't wish to run the tests each time.

HTH,
Glen


> Thanks!
>
> -Erik
>


-- 
Glen Mazza
Talend Community Coders - coders.talend.com
blog: www.jroller.com/gmazza

Re: development environment for hadoop core

Posted by Surenkumar Nihalani <su...@me.com>.

I use Eclipse. I haven't figured out how to run and use mvn from it. I just use it as a editor. I have a git repo in commons/src. A branch for each jira. I rebase on branches to keep pulling in svn updates on branches. 


On Jan 15, 2013, at 9:08 PM, Andy Isaacson <ad...@cloudera.com> wrote:

> On Tue, Jan 15, 2013 at 3:50 PM, Erik Paulson <ep...@unit1127.com> wrote:
>> I'm curious what Hadoop developers use for their day-to-day hacking on
>> Hadoop. I'm talking changes to the Hadoop libraries and daemons, and not
>> developing Map-Reduce jobs or using using the HDFS Client libraries to talk
>> to a filesystem from an application.
>> 
>> I've checked out Hadoop, made minor changes and built it with Maven, and
>> tracked down the resulting artifacts in a target/ directory that I could
>> deploy. Is this typically how a cloudera/hortonworks/mapr/etc dev works, or
>> are the IDEs more common?
> 
> I use both vim and Eclipse (3.8.0~rc4-1 from Debian). I use git for
> version control with a branch per JIRA. Most testing is done with
> jUnit tests, I try to write a testcase to repro a bug before trying to
> fix the bug. Sometimes for a particular bug I need to install
> artifacts on a cluster (of VMs or physical machines) during the
> edit-compile-debug cycle; in such cases I build with mvn and carefully
> choose which artifacts need to be updated on the target cluster using
> rsync to speed up the cycle.
> 
> It's pretty difficult to develop in Java without using Eclipse or
> similar. Like Todd I stuck to my preferred editor environment for
> several months but found the IDE crutch too useful to avoid entirely.
> Luckily nowadays Eclipse and vim synchronize through the filesystem
> pretty well (much better than 6-8 years ago); I haven't yet lost even
> a single line of code due to "oh you edited the same file in two
> editors and they overwrote one another"; both vim and Eclipse
> carefully say "It was changed on disk! Oh Noes! What shall we do?".
> 
> You can run jUnit tests from either Eclipse or mvn, and I do both regularly.
> 
> -andy

Re: development environment for hadoop core

Posted by Andy Isaacson <ad...@cloudera.com>.

On Tue, Jan 15, 2013 at 3:50 PM, Erik Paulson <ep...@unit1127.com> wrote:
> I'm curious what Hadoop developers use for their day-to-day hacking on
> Hadoop. I'm talking changes to the Hadoop libraries and daemons, and not
> developing Map-Reduce jobs or using using the HDFS Client libraries to talk
> to a filesystem from an application.
>
> I've checked out Hadoop, made minor changes and built it with Maven, and
> tracked down the resulting artifacts in a target/ directory that I could
> deploy. Is this typically how a cloudera/hortonworks/mapr/etc dev works, or
> are the IDEs more common?

I use both vim and Eclipse (3.8.0~rc4-1 from Debian). I use git for
version control with a branch per JIRA. Most testing is done with
jUnit tests, I try to write a testcase to repro a bug before trying to
fix the bug. Sometimes for a particular bug I need to install
artifacts on a cluster (of VMs or physical machines) during the
edit-compile-debug cycle; in such cases I build with mvn and carefully
choose which artifacts need to be updated on the target cluster using
rsync to speed up the cycle.

It's pretty difficult to develop in Java without using Eclipse or
similar. Like Todd I stuck to my preferred editor environment for
several months but found the IDE crutch too useful to avoid entirely.
Luckily nowadays Eclipse and vim synchronize through the filesystem
pretty well (much better than 6-8 years ago); I haven't yet lost even
a single line of code due to "oh you edited the same file in two
editors and they overwrote one another"; both vim and Eclipse
carefully say "It was changed on disk! Oh Noes! What shall we do?".

You can run jUnit tests from either Eclipse or mvn, and I do both regularly.

-andy

Re: development environment for hadoop core

Posted by Hitesh Shah <hi...@hortonworks.com>.

On Jan 16, 2013, at 6:17 AM, Gopal Vijayaraghavan wrote:

> So, this is a question I have for everyone else.
> 
> How do I change the hadoop version of an entire build, so that I can
> name it something unique & use it in other builds in maven (-SNAPSHOT
> doesn't cut it, since occasionally mvn will download the hadoop snap
> poms from the remote repos).
> 

The following should work: ( from http://wiki.apache.org/hadoop/HowToReleasePostMavenization)

$ export version=3.0.0-TEST1
$ mvn versions:set -DnewVersion=${version}

-- Hitesh

Re: development environment for hadoop core

Posted by Gopal Vijayaraghavan <go...@hortonworks.com>.

Not quite an advance developer, but I learnt some shortcuts for my dev
cycle along the way.

> I've checked out Hadoop, made minor changes and built it with Maven, and
> tracked down the resulting artifacts in a target/ directory that I could
> deploy. Is this typically how a cloudera/hortonworks/mapr/etc dev works, or
> are the IDEs more common?

I mostly stuck to vim for my editor, with a few exceptions (Eclipse is
great for
browsing class to class) & mvn eclipse:eclipse works great.

I end up doing mvn package -Pdist

That give you a hadoop-dist/target/hadoop-${version} to work from

>From then on, the mini-cluster is your friend.

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CLIMiniCluster.html

all I usually specify is  -rmport 8032

Next thing I learnt was that for most of dev work, file:/// works
great instead of hdfs

for instance in hive, I could just give

-hiveconf fs.default.name=file://$(FS)/
-hiveconf hive.metastore.warehouse.dir=file://$(FS)/warehouse

(of course, substituting FS for something useful like /tmp/hive/)

and run my queries without worrying about HDFS overheads.

Using file:/// urls for map input and output occasionally simplifies
your debugging a lot.

So basically, you could run

./bin/hadoop jar
share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount
file:///usr/share/dict/words file:///tmp/run1

Or you could just use localhost:9000 in the minicluster if you really
want to test out the HDFS client ops.

Figuring out how to run hadoop in the non-cluster mode has been the
most produtivity inducing thing I've learnt.

Hope that helps.

> I realize this sort of sounds like a dumb question, but I'm mostly curious
> what I might be missing out on if I stay away from anything other than vim,
> and not being entirely sure where maven might be caching jars that it uses
> to build, and how careful I have to be to ensure that my changes wind up in
> the right places without having to do a clean build every time.

find ~/.m2/ helps a bit, but occasionally when I do break the API of
something basic like Writable, I want to use my version of the hadoop
libs for that project.

So, this is a question I have for everyone else.

How do I change the hadoop version of an entire build, so that I can
name it something unique & use it in other builds in maven (-SNAPSHOT
doesn't cut it, since occasionally mvn will download the hadoop snap
poms from the remote repos).

Cheers,
Gopal