You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2016/11/02 16:10:33 UTC

Nutch 1.x on hadoop

I'm having trouble trying to get Nutch 1.12 to run on hadoop 2.7.3.
I get a class not found exception for org.apache.nutch.crawl.Crawl, as in the following attempt.
$HADOOP_HOME/bin/hadoop jar "/home/mjc/apache-nutch-1.12/runtime/deploy/apache-nutch-1.12.job" org.apache.nutch.crawl.Crawl seed -dir seed -depth 1 -topN 5Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

Searching the web, I see that things seem to have changed in recent versions of Nutch. However, I have not been able to find a good tutorial or step-by-step guide for how to get this to work. I would appreciate any advice you could give. Is there documentation somewhere? Should I be using an older version??


Re: Nutch 1.x on hadoop

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.
That makes a lot of sense. I had a problem with the tracking UI that I had to solve by disabling IPV6 om my machine. Now it is working better!


      From: Julien Nioche <li...@gmail.com>
 To: "user@nutch.apache.org" <us...@nutch.apache.org>; Michael Coffey <mc...@yahoo.com> 
 Sent: Thursday, November 3, 2016 2:33 AM
 Subject: Re: Nutch 1.x on hadoop
  
Hi Mickael

You can click on the logs for the fetch tasks to see the URLs being fetched

J.

On 3 November 2016 at 02:05, Michael Coffey <mc...@yahoo.com.invalid>
wrote:

> Thanks, that was very helpful!
> Another newbie question: when I run nutch standalone, I can see what it's
> trying to fetch (in my terminal) as it goes along. How can I watch what
> it's doing when it runs under hadoop? I have clicked around a little bit in
> the hadoop monitoring web app, but haven't found it yet.
>
>
>      From: Julien Nioche <li...@gmail.com>
>  To: "user@nutch.apache.org" <us...@nutch.apache.org>; Michael Coffey <
> mcoffey@yahoo.com>
>  Sent: Wednesday, November 2, 2016 9:51 AM
>  Subject: Re: Nutch 1.x on hadoop
>
> Michael,
>
> See
> http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-
> aws-cloudsearch.html
> for a relatively recent step-by-step tutorial for Nutch 1.x
>
> Julien
>
>
>
> On 2 November 2016 at 16:10, Michael Coffey <mc...@yahoo.com.invalid>
> wrote:
>
> > I'm having trouble trying to get Nutch 1.12 to run on hadoop 2.7.3.
> > I get a class not found exception for org.apache.nutch.crawl.Crawl, as in
> > the following attempt.
> > $HADOOP_HOME/bin/hadoop jar "/home/mjc/apache-nutch-1.12/
> > runtime/deploy/apache-nutch-1.12.job" org.apache.nutch.crawl.Crawl seed
> > -dir seed -depth 1 -topN 5Exception in thread "main" java.lang.
> ClassNotFoundException:
> > org.apache.nutch.crawl.Crawl        at java.net.URLClassLoader$1.run(
> > URLClassLoader.java:366)
> >
> > Searching the web, I see that things seem to have changed in recent
> > versions of Nutch. However, I have not been able to find a good tutorial
> or
> > step-by-step guide for how to get this to work. I would appreciate any
> > advice you could give. Is there documentation somewhere? Should I be
> using
> > an older version??
> >
> >
>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble <http://twitter.com/digitalpebble>
>
>
>




-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>


   

Re: Nutch 1.x on hadoop

Posted by Julien Nioche <li...@gmail.com>.
Hi Mickael

You can click on the logs for the fetch tasks to see the URLs being fetched

J.

On 3 November 2016 at 02:05, Michael Coffey <mc...@yahoo.com.invalid>
wrote:

> Thanks, that was very helpful!
> Another newbie question: when I run nutch standalone, I can see what it's
> trying to fetch (in my terminal) as it goes along. How can I watch what
> it's doing when it runs under hadoop? I have clicked around a little bit in
> the hadoop monitoring web app, but haven't found it yet.
>
>
>       From: Julien Nioche <li...@gmail.com>
>  To: "user@nutch.apache.org" <us...@nutch.apache.org>; Michael Coffey <
> mcoffey@yahoo.com>
>  Sent: Wednesday, November 2, 2016 9:51 AM
>  Subject: Re: Nutch 1.x on hadoop
>
> Michael,
>
> See
> http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-
> aws-cloudsearch.html
> for a relatively recent step-by-step tutorial for Nutch 1.x
>
> Julien
>
>
>
> On 2 November 2016 at 16:10, Michael Coffey <mc...@yahoo.com.invalid>
> wrote:
>
> > I'm having trouble trying to get Nutch 1.12 to run on hadoop 2.7.3.
> > I get a class not found exception for org.apache.nutch.crawl.Crawl, as in
> > the following attempt.
> > $HADOOP_HOME/bin/hadoop jar "/home/mjc/apache-nutch-1.12/
> > runtime/deploy/apache-nutch-1.12.job" org.apache.nutch.crawl.Crawl seed
> > -dir seed -depth 1 -topN 5Exception in thread "main" java.lang.
> ClassNotFoundException:
> > org.apache.nutch.crawl.Crawl        at java.net.URLClassLoader$1.run(
> > URLClassLoader.java:366)
> >
> > Searching the web, I see that things seem to have changed in recent
> > versions of Nutch. However, I have not been able to find a good tutorial
> or
> > step-by-step guide for how to get this to work. I would appreciate any
> > advice you could give. Is there documentation somewhere? Should I be
> using
> > an older version??
> >
> >
>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble <http://twitter.com/digitalpebble>
>
>
>




-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: Nutch 1.x on hadoop

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.
Thanks, that was very helpful!
Another newbie question: when I run nutch standalone, I can see what it's trying to fetch (in my terminal) as it goes along. How can I watch what it's doing when it runs under hadoop? I have clicked around a little bit in the hadoop monitoring web app, but haven't found it yet.


      From: Julien Nioche <li...@gmail.com>
 To: "user@nutch.apache.org" <us...@nutch.apache.org>; Michael Coffey <mc...@yahoo.com> 
 Sent: Wednesday, November 2, 2016 9:51 AM
 Subject: Re: Nutch 1.x on hadoop
   
Michael,

See
http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html
for a relatively recent step-by-step tutorial for Nutch 1.x

Julien



On 2 November 2016 at 16:10, Michael Coffey <mc...@yahoo.com.invalid>
wrote:

> I'm having trouble trying to get Nutch 1.12 to run on hadoop 2.7.3.
> I get a class not found exception for org.apache.nutch.crawl.Crawl, as in
> the following attempt.
> $HADOOP_HOME/bin/hadoop jar "/home/mjc/apache-nutch-1.12/
> runtime/deploy/apache-nutch-1.12.job" org.apache.nutch.crawl.Crawl seed
> -dir seed -depth 1 -topN 5Exception in thread "main" java.lang.ClassNotFoundException:
> org.apache.nutch.crawl.Crawl        at java.net.URLClassLoader$1.run(
> URLClassLoader.java:366)
>
> Searching the web, I see that things seem to have changed in recent
> versions of Nutch. However, I have not been able to find a good tutorial or
> step-by-step guide for how to get this to work. I would appreciate any
> advice you could give. Is there documentation somewhere? Should I be using
> an older version??
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>


   

Re: Nutch 1.x on hadoop

Posted by Julien Nioche <li...@gmail.com>.
Michael,

See
http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html
for a relatively recent step-by-step tutorial for Nutch 1.x

Julien



On 2 November 2016 at 16:10, Michael Coffey <mc...@yahoo.com.invalid>
wrote:

> I'm having trouble trying to get Nutch 1.12 to run on hadoop 2.7.3.
> I get a class not found exception for org.apache.nutch.crawl.Crawl, as in
> the following attempt.
> $HADOOP_HOME/bin/hadoop jar "/home/mjc/apache-nutch-1.12/
> runtime/deploy/apache-nutch-1.12.job" org.apache.nutch.crawl.Crawl seed
> -dir seed -depth 1 -topN 5Exception in thread "main" java.lang.ClassNotFoundException:
> org.apache.nutch.crawl.Crawl        at java.net.URLClassLoader$1.run(
> URLClassLoader.java:366)
>
> Searching the web, I see that things seem to have changed in recent
> versions of Nutch. However, I have not been able to find a good tutorial or
> step-by-step guide for how to get this to work. I would appreciate any
> advice you could give. Is there documentation somewhere? Should I be using
> an older version??
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: Nutch 1.x on hadoop

Posted by Divjot Singh <di...@gmail.com>.
Hi

I have used nutch 2.3 so don't know it would help with 1.x. In the deploy
folder there is a crawl script in bin folder.

*runtime/deploy/bin/crawl /tmp/seed.txt group_a 1000 *

the seed.txt file should copied to hdfs.

Thanks
Divjot

On Wed, Nov 2, 2016 at 9:40 PM, Michael Coffey <mc...@yahoo.com.invalid>
wrote:

> I'm having trouble trying to get Nutch 1.12 to run on hadoop 2.7.3.
> I get a class not found exception for org.apache.nutch.crawl.Crawl, as in
> the following attempt.
> $HADOOP_HOME/bin/hadoop jar "/home/mjc/apache-nutch-1.12/
> runtime/deploy/apache-nutch-1.12.job" org.apache.nutch.crawl.Crawl seed
> -dir seed -depth 1 -topN 5Exception in thread "main" java.lang.ClassNotFoundException:
> org.apache.nutch.crawl.Crawl        at java.net.URLClassLoader$1.run(
> URLClassLoader.java:366)
>
> Searching the web, I see that things seem to have changed in recent
> versions of Nutch. However, I have not been able to find a good tutorial or
> step-by-step guide for how to get this to work. I would appreciate any
> advice you could give. Is there documentation somewhere? Should I be using
> an older version??
>
>