You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Andy Morris <an...@woodward.edu> on 2006/02/03 03:50:15 UTC

Some guidence please

Thanks Chris...

Here is my situation...
I want to crawl just a local site for our intranet.   We have just
rolled out an asp only website from a pure html site.  I ran nutch on
the old site and got great results.  Since moving to this new site I am
have a devil of a time retrieving good information and missing a ton of
info all together.  I am not sure what settings I need to change to get
good results.  One setting that I have set does produce good results but
it seems to crawl other website and not just my domain.  The last line
of the crawl-urlfilter file I just replace the - with + so it does not
ignore other information. Our site is www.woodward.edu I was wondering
if someone on this list can crawl this site and only this domain and see
what they come up with.  Woodward.edu is the domain.  I am just stumped
as what to do next.  I am running a nightly build from January 26th
2006. 

My criteria for our local search is to be able to search PDF, images,
doc, and web content.  You can go here and see what the search page
pulls up http://search.woodward.edu .

Thanks for any help this list can provide.
Andy Morris 

-----Original Message-----
From: Chris Mattmann [mailto:chris.mattmann@jpl.nasa.gov] 
Sent: Thursday, February 02, 2006 7:59 PM
To: nutch-user@lucene.apache.org
Subject: RE: Xml?

Hi Andy,

> What is this error from?

Wow, super cool! You're the first post I've seen to the list regarding
these log messages that I put in :-) For that matter, they're log
warnings, not errors really:

> 060202 141539 ParserFactory:Plugin: parse-text mapped to contentType 
> text/xml via parse-plugins.xml, but its plugin.xml file does not claim

> to support contentType: text/xml

This one says that you have the parse-text plugin mapped to the
contentType "text/xml" in the parse-plugins.xml file. However, this is
kind of weird because the plugin.xml file for the parse-text plugin does
not claim to support "text/xml". So, it's just a warning.

> 060202 141539 ParserFactory:Plugin: parse-html mapped to contentType 
> text/xml via parse-plugins.xml, but its plugin.xml file does not claim

> to support contentType: text/xml

Same issue here.

> 060202 141539 ParserFactory: Plugin: parse-rss mapped to contentType 
> text/xml via parse-plugins.xml, but not enabled via plugin.includes in

> nutch-default.xml

This is another cool one (in my opinion :-) ). It says that you went
ahead and mapped parse-rss to the contentType "text/xml" in
parse-plugins.xml, however, you didn't enable parse-rss in the
plugin.includes property in nutch-default.xml, or nutch-site.xml.

Does that make sense?

Cheers,
  Chris

> 
> Andy