You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Fred Zimmerman <zi...@gmail.com> on 2011/10/26 16:37:14 UTC

1) success 2) how to tell Nutch "index everything"

1) I resolved the issues with solrindex. It turned out to be a matter of
adding all the nutch schema-specific fields to solr's schema.xml.  there was
one gotcha which is that the latest solr schema does not have a default
fieldtype "text" as in Nutch 1.3/schema.xml; you must use "text_general".  A
comment for developers is that the use case of copying the nutch schema to
overwrite the solr one only works for people who are beginning their
indexing with a crawl.  More detailed instructions on how to modify
solr/schema.xml for nutch would be helpful, or better yet, a script to add
the appropriate fields.

2) is there a way to tell Nutch to index everything at a given site?  I am
crawling a couple of my own sites and it seems rather clumsy just to give
Nutch a big "TopN."  wouldn't an "all" value be helpful?

Re: 1) success 2) how to tell Nutch "index everything"

Posted by Markus Jelsma <ma...@openindex.io>.

On Wednesday 26 October 2011 16:37:14 Fred Zimmerman wrote:
> 1) I resolved the issues with solrindex. It turned out to be a matter of
> adding all the nutch schema-specific fields to solr's schema.xml.  there
> was one gotcha which is that the latest solr schema does not have a
> default fieldtype "text" as in Nutch 1.3/schema.xml; you must use
> "text_general".

You're free to use any nameconvention you want in Solr. We ship a complete 
working Solr schema. The fieldType's name doesn't really matter. We do not 
intend to ship an advanced schema, developers must make changes that are 
appropriate for their specific environment, use-cases and scenario.
 
> A comment for developers is that the use case of copying
> the nutch schema to overwrite the solr one only works for people who are
> beginning their indexing with a crawl.  More detailed instructions on how
> to modify solr/schema.xml for nutch would be helpful, or better yet, a
> script to add the appropriate fields.

The Solr schema provided with Nutch tells you exactly which fields are used. 
Detailed instructions on how to work it with Solr is out-of-scope in my 
opinion.
You're ofcourse free to make changes to the wiki :)

> 
> 2) is there a way to tell Nutch to index everything at a given site?  I am
> crawling a couple of my own sites and it seems rather clumsy just to give
> Nutch a big "TopN."  wouldn't an "all" value be helpful?

Only way to do this is keep running a crawl cycle until all existing and urls-
to-be-discovered are exhausted until fetch interval tells the generator to 
refetch.

Cheers


-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350