You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Brian Griffey <bg...@shopsavvy.mobi> on 2011/06/03 23:27:09 UTC

Nutch not crawling on a pre-existing hadoop cluster?

Hi all,

I recently downloaded nutch onto my local machine. I wrote a few plugins for it and successfully crawled a few sites to make sure that my parsers and indexers worked well. I then moved the nutch installation onto our pre-existing hadoop cluster by copying the needed libs, confs, and the build/plugins dir onto every machine in the hadoop cluster, I also adjusted the nutch-site.xml to point the plugins to the hard coded path where the plugins sit. The nutch system runs without errors, however it never past a few pages. It just seems to get stuck only grabbing one page per level and gets that page on every pass. I have included the interesting files and sys logs in the attachment for easy viewing. Anyone have any ideas on why it's not going forward? It also just seems to abort threads, any ideas?

2011-06-03 16:20:51,559 WARN org.apache.nutch.parse.ParserFactory: ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml 2011-06-03 16:20:51,629 INFO org.apache.nutch.fetcher.Fetcher: -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19 2011-06-03 16:20:51,629 WARN org.apache.nutch.fetcher.Fetcher: Aborting with 10 hung threads.

-- 
Brian Griffey
ShopSavvy Android and Big Data Developer
650-352-1429

Re: Nutch not crawling on a pre-existing hadoop cluster?

Posted by Julien Nioche <li...@gmail.com>.

Hi Brian,

Would be easier to simply generate a job file and the script in bin to run
the tasks. Hardcopying the plugins + jars on each machine is not practical.
The reason we separated the jars+plugins approach from the job in the
runtimes for 1.3 was to avoid possible conflicts.

Julien



> I recently downloaded nutch onto my local machine.  I wrote a few plugins
> for it and successfully crawled a few sites to make sure that my parsers and
> indexers worked well.  I then moved the nutch installation onto our
> pre-existing hadoop cluster by copying the needed libs, confs, and the
> build/plugins dir onto every machine in the hadoop cluster, I also adjusted
> the nutch-site.xml to point the plugins to the hard coded path where the
> plugins sit.  The nutch system runs without errors, however it never past a
> few pages. It just seems to get stuck only grabbing one page per level and
> gets that page on every pass. I have included the interesting files and sys
> logs in the attachment for easy viewing. Anyone have any ideas on why it's
> not going forward? It also just seems to abort threads, any ideas?
>
> 2011-06-03 16:20:51,559 WARN org.apache.nutch.parse.ParserFactory: ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml
> 2011-06-03 16:20:51,629 INFO org.apache.nutch.fetcher.Fetcher: -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19
> 2011-06-03 16:20:51,629 WARN org.apache.nutch.fetcher.Fetcher: Aborting with 10 hung threads.
>
>
> --
> Brian Griffey
> ShopSavvy Android and Big Data Developer
> 650-352-1429
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Nutch not crawling on a pre-existing hadoop cluster?

Posted by MilleBii <mi...@gmail.com>.

Aborting does not look wrong, it always does it at the end of a fetch cycle.

Do you use the one stop crawl command or step-by-step. In the latter case
you have more ability to see where it might fail.

We don't get attachments in this mailing list.

2011/6/3 Brian Griffey <bg...@shopsavvy.mobi>

>  Hi all,
>
> I recently downloaded nutch onto my local machine.  I wrote a few plugins
> for it and successfully crawled a few sites to make sure that my parsers and
> indexers worked well.  I then moved the nutch installation onto our
> pre-existing hadoop cluster by copying the needed libs, confs, and the
> build/plugins dir onto every machine in the hadoop cluster, I also adjusted
> the nutch-site.xml to point the plugins to the hard coded path where the
> plugins sit.  The nutch system runs without errors, however it never past a
> few pages. It just seems to get stuck only grabbing one page per level and
> gets that page on every pass. I have included the interesting files and sys
> logs in the attachment for easy viewing. Anyone have any ideas on why it's
> not going forward? It also just seems to abort threads, any ideas?
>
> 2011-06-03 16:20:51,559 WARN org.apache.nutch.parse.ParserFactory: ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml
> 2011-06-03 16:20:51,629 INFO org.apache.nutch.fetcher.Fetcher: -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19
> 2011-06-03 16:20:51,629 WARN org.apache.nutch.fetcher.Fetcher: Aborting with 10 hung threads.
>
>
> --
> Brian Griffey
> ShopSavvy Android and Big Data Developer
> 650-352-1429
>
>


-- 
-MilleBii-