You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "vlad.paunescu" <vl...@gmail.com> on 2012/05/25 13:07:31 UTC

Using Nutch for Web Site Mirroring

Hello,

I am currently trying to use Nutch as a web site mirroring tool. To be more
explicit, I only need to download the pages, not to index them (I do not
intend to use it as a search engine). I couldn't figure a simpler way to
accomplish my task, so what I do now is:

- crawl the site, using the url;
- merge the segments;
- read segments (dump) and make it show the content.

I didn't manage however to configure Nutch in order to change absolute links
to local links (e.g. href="http://www.example.com/dir/pag.html" to be
transformed in href="dir/pag.html"). I found URLNormalizer, but I don't
understand what it does, if it only scans the crawled page url and
transforms it, or it scans the content of the page being crawled, and
modifies the href or src attributes.

I would also want to know if you can configure Nutch to create a directory
tree with all the pages it crawled. Now, I only have the dumped content
which needs to be parsed by a Java program I am currently writing in order
to create directory tree that matches the site's structure.

Any help will be much appreciated! Thank you!
Vlad


--
View this message in context: http://lucene.472066.n3.nabble.com/Using-Nutch-for-Web-Site-Mirroring-tp3986066.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Using Nutch for Web Site Mirroring

Posted by "vlad.paunescu" <vl...@gmail.com>.

Yes, I really want to use Nutch, because Nutch has an API, and wget doesn't.
I want to create a module which can take jobs for importing sites. This
module is responsible for taking requests of web site downloads (it will be
an import site module). I chose Nutch instead of writing my own crawler,
because it is open source, it's solid (I can expect to have a lot of bugs if
I write my own crawler). If I chose wget, I would need to create a process
for every request of site importing, which is not what I want.

--
View this message in context: http://lucene.472066.n3.nabble.com/Using-Nutch-for-Web-Site-Mirroring-tp3986066p3986086.html
Sent from the Nutch - User mailing list archive at Nabble.com.