You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Noah Silverman <no...@smartmediacorp.com> on 2009/12/20 23:07:44 UTC

Use nutch like wget

Hi,

We have need to archive the contents of a few large websites.

Nutch is a great crawler that works very quickly.

Is it possible to use Nutch to "mirror" a website.  We want to download
and STORE all the files in their original format (even images, jpg, gif,
png, etc.)

I have managed to get Nutch to crawl and index the pages using the
simple nutch crawl command.  But don't see where the raw files are stored.

Additionally, we don't need to index the content for this project.  Just
fetch and store.

Can anybody point me to the right way to do this with Nutch?

Thanks!

-N

Re: Use nutch like wget

Posted by MilleBii <mi...@gmail.com>.

Look  $NUTCH_HOME/segments/<segmentdate>/

The content is there under different directories depending on which want it.
./content is the original fetch data

2009/12/21, Matthew A. Bockol <mb...@carleton.edu>:
> Hi Noah,
>
> Take a look at NutchWAX and the heritrix crawler, both from Archive.org.
>
> Matt
>
>
> ----- Original Message -----
> From: "Noah Silverman" <no...@smartmediacorp.com>
> To: nutch-user@lucene.apache.org
> Sent: Sunday, December 20, 2009 4:07:44 PM
> Subject: Use nutch like wget
>
> Hi,
>
> We have need to archive the contents of a few large websites.
>
> Nutch is a great crawler that works very quickly.
>
> Is it possible to use Nutch to "mirror" a website.  We want to download
> and STORE all the files in their original format (even images, jpg, gif,
> png, etc.)
>
> I have managed to get Nutch to crawl and index the pages using the
> simple nutch crawl command.  But don't see where the raw files are stored.
>
> Additionally, we don't need to index the content for this project.  Just
> fetch and store.
>
> Can anybody point me to the right way to do this with Nutch?
>
> Thanks!
>
> -N
>


-- 
-MilleBii-

Re: Use nutch like wget

Posted by "Matthew A. Bockol" <mb...@carleton.edu>.

Hi Noah,

Take a look at NutchWAX and the heritrix crawler, both from Archive.org. 

Matt

----- Original Message -----
From: "Noah Silverman" <no...@smartmediacorp.com>
To: nutch-user@lucene.apache.org
Sent: Sunday, December 20, 2009 4:07:44 PM
Subject: Use nutch like wget

Hi,

We have need to archive the contents of a few large websites.

Nutch is a great crawler that works very quickly.

Is it possible to use Nutch to "mirror" a website.  We want to download
and STORE all the files in their original format (even images, jpg, gif,
png, etc.)

I have managed to get Nutch to crawl and index the pages using the
simple nutch crawl command.  But don't see where the raw files are stored.

Additionally, we don't need to index the content for this project.  Just
fetch and store.

Can anybody point me to the right way to do this with Nutch?

Thanks!

-N