You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mike Pountney <Mi...@semantico.com> on 2012/08/07 13:50:42 UTC

SOLR Indexing issue, possibly due to NUTCH-1084?

Hi there,

I have an issue with our Nutch 1.4 deployment, whereby a page that has been successfully crawled (readdb -dump gives db_fetched status) is not being indexed into SOLR.

Trying to retrieve the content using:

nutch readdb $crawldb -url $url

gives:

java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus

... which appears to be a known bug as per NUTCH-1084.

Could this be the reason why the content is not being indexed? Does 'nutch solrindex' iterate through the pages using the same codebas that is failing?

Is there any workaround to the NUTCH-1084 issue? It's occurring on about 10% of the pages we've crawled, the rest are fine (and appear to be indexed)

We're running this under the Hadoop 0.20 task/jobtracker incidentally, on a single node with no HDFS usage. 

Any help is greatly appreciated.

Mike