You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "onlinespending@gmail.com" <on...@gmail.com> on 2010/09/02 00:44:16 UTC

Selective Fetching and Notifying When Files Have Been Modifed Since Last Fetch

  Hi,

I'd like to use Nutch to crawl a very limited set of pages.  But as it's 
crawling I'd like for it to only fetch particular pages and files that 
match certain criteria.  I'd also like that I am somehow alerted when 
any of these fetched files have been modified (modify date of the file 
or change in content) since the last time it was fetched by Nutch.

My criteria are as follows.

1.) save HTML if the page name or contents contain a word or words (or 
if the anchor from the page that linked to it had the word)
2.) save all images above a certain size (or resolution)
3.) save all PDF files (no filter)

I'd like that these files (HTML, images, and PDFs) be saved into a 
separate folder for each domain, and somehow save the date that the file 
was modified.  This date would be compared the next time a fetch is 
made, and an alert would somehow be made to let me know that a new 
version of one of the files has been fetched (ideally this alert would 
be done by email).

I don't intend to really use Lucene/SOLR to search the contents of these 
files.  I merely want that these files be fetched and organized by the 
domain URL.  The modify alert allows me to then go into each of the 
directories to see view the modified files.

As you can tell I'm a Nutch newbie, but could not find any related 
information in the tutorials.  I appreciate any help getting such a 
setup going.

Thanks,
Ben

Fwd: Selective Fetching and Notifying When Files Have Been Modifed Since Last Fetch

Posted by Sonal Goyal <so...@gmail.com>.
Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal


---------- Forwarded message ----------
From: Sonal Goyal <so...@gmail.com>
Date: Thu, Sep 2, 2010 at 10:33 PM
Subject: Re: Selective Fetching and Notifying When Files Have Been Modifed
Since Last Fetch
To: user@nutch.apache.org


Ben,

In one of our projects, we run a post processor which saves each fetched
file to a file location based on its url. Thats one way you can try. For
selective fetching and getting all PDFs, you can create your own scoring
logic.

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal



On Thu, Sep 2, 2010 at 4:14 AM, onlinespending@gmail.com <
onlinespending@gmail.com> wrote:

>  Hi,
>
> I'd like to use Nutch to crawl a very limited set of pages.  But as it's
> crawling I'd like for it to only fetch particular pages and files that match
> certain criteria.  I'd also like that I am somehow alerted when any of these
> fetched files have been modified (modify date of the file or change in
> content) since the last time it was fetched by Nutch.
>
> My criteria are as follows.
>
> 1.) save HTML if the page name or contents contain a word or words (or if
> the anchor from the page that linked to it had the word)
> 2.) save all images above a certain size (or resolution)
> 3.) save all PDF files (no filter)
>
> I'd like that these files (HTML, images, and PDFs) be saved into a separate
> folder for each domain, and somehow save the date that the file was
> modified.  This date would be compared the next time a fetch is made, and an
> alert would somehow be made to let me know that a new version of one of the
> files has been fetched (ideally this alert would be done by email).
>
> I don't intend to really use Lucene/SOLR to search the contents of these
> files.  I merely want that these files be fetched and organized by the
> domain URL.  The modify alert allows me to then go into each of the
> directories to see view the modified files.
>
> As you can tell I'm a Nutch newbie, but could not find any related
> information in the tutorials.  I appreciate any help getting such a setup
> going.
>
> Thanks,
> Ben
>