You are viewing a plain text version of this content. The canonical link for it is here.
Posted to agent@nutch.apache.org by chandra shekher gupta <ch...@infoaxon.com> on 2007/12/26 09:40:10 UTC

How to Crawl CMS System

Hi Everybody,

We are planning to use Nutch-0.9 Web-Crawler. It works fine with any static
website that has some static content. It crawls and creates the binary DB.
We have another CMS that's content are stored as object in database. I mean
to say contentfile name and its content both resides in Database. so when we
do the crawling for this CMS System it gives  this error..

Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

I presume that it is not able to get the index file for root
directory..becuse in CMS system there is no index or any conent file... 

Lucene works fine with CMS system becuse on each content updation or
creation, Lucene indexer generates the binary and update its binary DB and
indexes.

Now my question is...

How Nuthch web-crawler will pick the content form the DB?
Do we need to write our own indexer that will update Crawl Binary?

Please share your idea...
Kind Regards
Chandra
Infoaxon Technology
-- 
View this message in context: http://www.nabble.com/How-to-Crawl-CMS-System-tp14500406p14500406.html
Sent from the Nutch - Agent mailing list archive at Nabble.com.