You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2012/03/12 14:32:57 UTC

Hostnames changed for lots of URLS in crawldb, solr index, how to change?

How would one go about changing the hostnames that a large number of urls
point to in both the crawldb as well as the solr index?  I tried running the
updatedb with the -normalize switch on. I added a regular expression in
regex-normalize.xml. Then I ran the solrindex command, but nothing seemed to
change in my search?

--
View this message in context: http://lucene.472066.n3.nabble.com/Hostnames-changed-for-lots-of-URLS-in-crawldb-solr-index-how-to-change-tp3819265p3819265.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Hostnames changed for lots of URLS in crawldb, solr index, how to change?

Posted by Markus Jelsma <ma...@openindex.io>.
Indexer to normalize URL's
https://issues.apache.org/jira/browse/NUTCH-1300

This will _not_ update existing documents! You have to reindex all segments 
with normalizing enabled.

On Monday 12 March 2012 14:32:57 webdev1977 wrote:
> How would one go about changing the hostnames that a large number of urls
> point to in both the crawldb as well as the solr index?  I tried running
> the updatedb with the -normalize switch on. I added a regular expression
> in regex-normalize.xml. Then I ran the solrindex command, but nothing
> seemed to change in my search?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Hostnames-changed-for-lots-of-URLS-in-c
> rawldb-solr-index-how-to-change-tp3819265p3819265.html Sent from the Nutch
> - User mailing list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex