You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/12/13 17:41:19 UTC

Upgrading to Hadoop 0.22.0+

Hi,

To keep up with the rest of the world i believe we should move from the old 
Hadoop mapred API to the new MapReduce API, which has already been done for 
the nutchgora branch. Upgrading from hadoop-core to hadoop-common is easily 
done in Ivy but all jobs must be tackled and we have many jobs!

Anyone to give pointers and helping hand in this large task?

Cheers,

-- 
Markus Jelsma - CTO - Openindex

Re: Upgrading to Hadoop 0.22.0+

Posted by Markus Jelsma <ma...@openindex.io>.

> On 13/12/2011 18:04, Markus Jelsma wrote:
> > Hi
> > 
> > I did a quick test to see what happens and it won't compile. It cannot
> > find our old mapred API's in 0.22. I've also tried 0.20.205.0 which
> > compiles but won't run and many tests fail with stuff like.
> > 
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > org/codehaus/jackson/map/JsonMappingException
> > 
> >          at
> > 
> > org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplicat
> > or.java:421)
> 
> Hmm... what's that? I don't see this class (or this package) in the
> Nutch tree. Also, trunk doesn't use JSON for anything as far as I know.

It's thrown when the job is run, must be a mapred thing.

> 
> >          at
> > 
> > org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java:4
> > 43)
> > 
> >          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >          at
> > 
> > org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.java:
> > 431) Caused by: java.lang.ClassNotFoundException:
> > org.codehaus.jackson.map.JsonMappingException
> > 
> >          at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> >          at java.security.AccessController.doPrivileged(Native Method)
> >          at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> >          at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> >          at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> >          at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> >          ... 4 more
> > 
> > I think this can be overcome but we cannot hide from the fact that all
> > jobs must be ported to the new API at some point.
> > 
> > You did some work on the new API's, did you come across any cumbersome
> > issues when working on it?
> 
> It was quite some time ago .. but I don't remember anything being really
> complicated, it was just tedious - and once you've done one class the
> other classes follow roughly the same pattern.

Hmm yes. I checked both Hadoop books and saw few migration slides. It 
shouldn't be too hard. I'll just give it a try on some custom jobs.

thanks

Re: Upgrading to Hadoop 0.22.0+

Posted by Markus Jelsma <ma...@openindex.io>.

0.21 is here, can we use this repo for Hadoop 0.21 poms? If so, how to change 
ivy stuff to do so? I can't seem to tell Ivy to fetch hadoop from here:

https://repository.apache.org/content/groups/snapshots/org/apache/hadoop/hadoop-
common/

On Wednesday 14 December 2011 16:05:07 Markus Jelsma wrote:
> Andrzej,
> 
> I cannot continue with testing migration on 0.20 because things like
> MapFileOutputFormat are missing in the new API. I cannot compile with 0.22
> because it no longer has the old mapred API. And i cannot build with 0.21
> because it is not  in maven central!?
> 
> Any help?
> 
> Thanks!
> 
> On Tuesday 13 December 2011 18:57:48 Andrzej Bialecki wrote:
> > On 13/12/2011 18:04, Markus Jelsma wrote:
> > > Hi
> > > 
> > > I did a quick test to see what happens and it won't compile. It cannot
> > > find our old mapred API's in 0.22. I've also tried 0.20.205.0 which
> > > compiles but won't run and many tests fail with stuff like.
> > > 
> > > Exception in thread "main" java.lang.NoClassDefFoundError:
> > > org/codehaus/jackson/map/JsonMappingException
> > > 
> > >          at
> > > 
> > > org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplic
> > > at or.java:421)
> > 
> > Hmm... what's that? I don't see this class (or this package) in the
> > Nutch tree. Also, trunk doesn't use JSON for anything as far as I know.
> > 
> > >          at
> > > 
> > > org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java
> > > :4 43)
> > > 
> > >          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >          at
> > > 
> > > org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.jav
> > > a: 431) Caused by: java.lang.ClassNotFoundException:
> > > org.codehaus.jackson.map.JsonMappingException
> > > 
> > >          at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> > >          at java.security.AccessController.doPrivileged(Native Method)
> > >          at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> > >          at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> > >          at
> > >          sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> > >          at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ...
> > >          4 more
> > > 
> > > I think this can be overcome but we cannot hide from the fact that all
> > > jobs must be ported to the new API at some point.
> > > 
> > > You did some work on the new API's, did you come across any cumbersome
> > > issues when working on it?
> > 
> > It was quite some time ago .. but I don't remember anything being really
> > complicated, it was just tedious - and once you've done one class the
> > other classes follow roughly the same pattern.

-- 
Markus Jelsma - CTO - Openindex

Re: Upgrading to Hadoop 0.22.0+

Posted by Markus Jelsma <ma...@openindex.io>.

Andrzej,

I cannot continue with testing migration on 0.20 because things like 
MapFileOutputFormat are missing in the new API. I cannot compile with 0.22 
because it no longer has the old mapred API. And i cannot build with 0.21 
because it is not  in maven central!?

Any help?

Thanks!


On Tuesday 13 December 2011 18:57:48 Andrzej Bialecki wrote:
> On 13/12/2011 18:04, Markus Jelsma wrote:
> > Hi
> > 
> > I did a quick test to see what happens and it won't compile. It cannot
> > find our old mapred API's in 0.22. I've also tried 0.20.205.0 which
> > compiles but won't run and many tests fail with stuff like.
> > 
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > org/codehaus/jackson/map/JsonMappingException
> > 
> >          at
> > 
> > org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplicat
> > or.java:421)
> 
> Hmm... what's that? I don't see this class (or this package) in the
> Nutch tree. Also, trunk doesn't use JSON for anything as far as I know.
> 
> >          at
> > 
> > org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java:4
> > 43)
> > 
> >          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >          at
> > 
> > org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.java:
> > 431) Caused by: java.lang.ClassNotFoundException:
> > org.codehaus.jackson.map.JsonMappingException
> > 
> >          at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> >          at java.security.AccessController.doPrivileged(Native Method)
> >          at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> >          at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> >          at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> >          at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> >          ... 4 more
> > 
> > I think this can be overcome but we cannot hide from the fact that all
> > jobs must be ported to the new API at some point.
> > 
> > You did some work on the new API's, did you come across any cumbersome
> > issues when working on it?
> 
> It was quite some time ago .. but I don't remember anything being really
> complicated, it was just tedious - and once you've done one class the
> other classes follow roughly the same pattern.

-- 
Markus Jelsma - CTO - Openindex

Re: Upgrading to Hadoop 0.22.0+

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

I added Jackson as a dependency and can now build Nutch with Hadoop 
0.20.205.0. Hadoop needs it. Should we commit this? I'd prefer migrating to 
that version before doing all API migrations.

Nutch runs fine with 205 locally and also on a 0.20.203 cluster.

Thanks

On Tuesday 13 December 2011 18:57:48 Andrzej Bialecki wrote:
> On 13/12/2011 18:04, Markus Jelsma wrote:
> > Hi
> > 
> > I did a quick test to see what happens and it won't compile. It cannot
> > find our old mapred API's in 0.22. I've also tried 0.20.205.0 which
> > compiles but won't run and many tests fail with stuff like.
> > 
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > org/codehaus/jackson/map/JsonMappingException
> > 
> >          at
> > 
> > org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplicat
> > or.java:421)
> 
> Hmm... what's that? I don't see this class (or this package) in the
> Nutch tree. Also, trunk doesn't use JSON for anything as far as I know.
> 
> >          at
> > 
> > org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java:4
> > 43)
> > 
> >          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >          at
> > 
> > org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.java:
> > 431) Caused by: java.lang.ClassNotFoundException:
> > org.codehaus.jackson.map.JsonMappingException
> > 
> >          at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> >          at java.security.AccessController.doPrivileged(Native Method)
> >          at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> >          at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> >          at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> >          at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> >          ... 4 more
> > 
> > I think this can be overcome but we cannot hide from the fact that all
> > jobs must be ported to the new API at some point.
> > 
> > You did some work on the new API's, did you come across any cumbersome
> > issues when working on it?
> 
> It was quite some time ago .. but I don't remember anything being really
> complicated, it was just tedious - and once you've done one class the
> other classes follow roughly the same pattern.

-- 
Markus Jelsma - CTO - Openindex

Re: Upgrading to Hadoop 0.22.0+

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 13/12/2011 18:04, Markus Jelsma wrote:
> Hi
>
> I did a quick test to see what happens and it won't compile. It cannot find
> our old mapred API's in 0.22. I've also tried 0.20.205.0 which compiles but
> won't run and many tests fail with stuff like.
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/codehaus/jackson/map/JsonMappingException
>          at
> org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplicator.java:421)

Hmm... what's that? I don't see this class (or this package) in the 
Nutch tree. Also, trunk doesn't use JSON for anything as far as I know.

>          at
> org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java:443)
>          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>          at
> org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.java:431)
> Caused by: java.lang.ClassNotFoundException:
> org.codehaus.jackson.map.JsonMappingException
>          at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>          at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>          at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>          at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>          ... 4 more
>
> I think this can be overcome but we cannot hide from the fact that all jobs
> must be ported to the new API at some point.
>
> You did some work on the new API's, did you come across any cumbersome issues
> when working on it?

It was quite some time ago .. but I don't remember anything being really 
complicated, it was just tedious - and once you've done one class the 
other classes follow roughly the same pattern.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Upgrading to Hadoop 0.22.0+

Posted by Markus Jelsma <ma...@openindex.io>.

Hi

I did a quick test to see what happens and it won't compile. It cannot find 
our old mapred API's in 0.22. I've also tried 0.20.205.0 which compiles but 
won't run and many tests fail with stuff like.

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/codehaus/jackson/map/JsonMappingException
        at 
org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplicator.java:421)
        at 
org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java:443)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at 
org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.java:431)
Caused by: java.lang.ClassNotFoundException: 
org.codehaus.jackson.map.JsonMappingException
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        ... 4 more

I think this can be overcome but we cannot hide from the fact that all jobs 
must be ported to the new API at some point.

You did some work on the new API's, did you come across any cumbersome issues 
when working on it?

Cheers


On Tuesday 13 December 2011 17:48:32 Andrzej Bialecki wrote:
> On 13/12/2011 17:42, Lewis John Mcgibbney wrote:
> > Hi Markus,
> > 
> > I'm certainly in agreement here. If you like to open a Jira, we can
> > begin the build up a picture of what is required.
> > 
> > Lewis
> > 
> > On Tue, Dec 13, 2011 at 4:41 PM, Markus Jelsma
> > 
> > <ma...@openindex.io>  wrote:
> >> Hi,
> >> 
> >> To keep up with the rest of the world i believe we should move from the
> >> old Hadoop mapred API to the new MapReduce API, which has already been
> >> done for the nutchgora branch. Upgrading from hadoop-core to
> >> hadoop-common is easily done in Ivy but all jobs must be tackled and we
> >> have many jobs!
> >> 
> >> Anyone to give pointers and helping hand in this large task?
> 
> I guess the question is also whether the 0.22 is compatible enough to
> compile more or less with the existing code that uses the old api. If it
> does, then we can do the transition gradually, if it doesn't then it's a
> bigger issue.
> 
> This is easy to verify - just drop in the 0.22 jars and see if it
> compiles / tests are passing.

-- 
Markus Jelsma - CTO - Openindex

Re: Upgrading to Hadoop 0.22.0+

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 13/12/2011 17:42, Lewis John Mcgibbney wrote:
> Hi Markus,
>
> I'm certainly in agreement here. If you like to open a Jira, we can
> begin the build up a picture of what is required.
>
> Lewis
>
> On Tue, Dec 13, 2011 at 4:41 PM, Markus Jelsma
> <ma...@openindex.io>  wrote:
>> Hi,
>>
>> To keep up with the rest of the world i believe we should move from the old
>> Hadoop mapred API to the new MapReduce API, which has already been done for
>> the nutchgora branch. Upgrading from hadoop-core to hadoop-common is easily
>> done in Ivy but all jobs must be tackled and we have many jobs!
>>
>> Anyone to give pointers and helping hand in this large task?

I guess the question is also whether the 0.22 is compatible enough to 
compile more or less with the existing code that uses the old api. If it 
does, then we can do the transition gradually, if it doesn't then it's a 
bigger issue.

This is easy to verify - just drop in the 0.22 jars and see if it 
compiles / tests are passing.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Upgrading to Hadoop 0.22.0+

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Markus,

I'm certainly in agreement here. If you like to open a Jira, we can
begin the build up a picture of what is required.

Lewis

On Tue, Dec 13, 2011 at 4:41 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Hi,
>
> To keep up with the rest of the world i believe we should move from the old
> Hadoop mapred API to the new MapReduce API, which has already been done for
> the nutchgora branch. Upgrading from hadoop-core to hadoop-common is easily
> done in Ivy but all jobs must be tackled and we have many jobs!
>
> Anyone to give pointers and helping hand in this large task?
>
> Cheers,
>
> --
> Markus Jelsma - CTO - Openindex



-- 
Lewis