You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Vlad Paunescu <vl...@gmail.com> on 2012/05/30 14:39:08 UTC

Using Nutch for Web Site Mirroring

Hello,

I am currently trying to use Nutch as a web site mirroring tool. To be more
explicit, I only need to download the pages, not to index them (I do not
intend to use it as a search engine). I couldn't figure a simpler way to
accomplish my task, so what I do now is:

- crawl the site, using the url;
- merge the segments;
- read segments (dump) and make it show the content.

I didn't manage however to configure Nutch in order to change absolute
links to local links (e.g. href="http://www.example.com/dir/pag.html" to be
transformed in href="dir/pag.html"). I found URLNormalizer, but I don't
understand what it does, if it only scans the crawled page url and
transforms it, or it scans the content of the page being crawled, and
modifies the href or src attributes.

I would also want to know if you can configure Nutch to create a directory
tree with all the pages it crawled. Now, I only have the dumped content
which needs to be parsed by a Java program I am currently writing in order
to create directory tree that matches the site's structure.

Any help will be much appreciated! Thank you!
Vlad