You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "onlinespending@gmail.com" <on...@gmail.com> on 2010/09/02 00:44:16 UTC
Selective Fetching and Notifying When Files Have Been Modifed Since
Last Fetch
Hi,
I'd like to use Nutch to crawl a very limited set of pages. But as it's
crawling I'd like for it to only fetch particular pages and files that
match certain criteria. I'd also like that I am somehow alerted when
any of these fetched files have been modified (modify date of the file
or change in content) since the last time it was fetched by Nutch.
My criteria are as follows.
1.) save HTML if the page name or contents contain a word or words (or
if the anchor from the page that linked to it had the word)
2.) save all images above a certain size (or resolution)
3.) save all PDF files (no filter)
I'd like that these files (HTML, images, and PDFs) be saved into a
separate folder for each domain, and somehow save the date that the file
was modified. This date would be compared the next time a fetch is
made, and an alert would somehow be made to let me know that a new
version of one of the files has been fetched (ideally this alert would
be done by email).
I don't intend to really use Lucene/SOLR to search the contents of these
files. I merely want that these files be fetched and organized by the
domain URL. The modify alert allows me to then go into each of the
directories to see view the modified files.
As you can tell I'm a Nutch newbie, but could not find any related
information in the tutorials. I appreciate any help getting such a
setup going.
Thanks,
Ben
Fwd: Selective Fetching and Notifying When Files Have Been Modifed
Since Last Fetch
Posted by Sonal Goyal <so...@gmail.com>.
Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal
---------- Forwarded message ----------
From: Sonal Goyal <so...@gmail.com>
Date: Thu, Sep 2, 2010 at 10:33 PM
Subject: Re: Selective Fetching and Notifying When Files Have Been Modifed
Since Last Fetch
To: user@nutch.apache.org
Ben,
In one of our projects, we run a post processor which saves each fetched
file to a file location based on its url. Thats one way you can try. For
selective fetching and getting all PDFs, you can create your own scoring
logic.
Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal
On Thu, Sep 2, 2010 at 4:14 AM, onlinespending@gmail.com <
onlinespending@gmail.com> wrote:
> Hi,
>
> I'd like to use Nutch to crawl a very limited set of pages. But as it's
> crawling I'd like for it to only fetch particular pages and files that match
> certain criteria. I'd also like that I am somehow alerted when any of these
> fetched files have been modified (modify date of the file or change in
> content) since the last time it was fetched by Nutch.
>
> My criteria are as follows.
>
> 1.) save HTML if the page name or contents contain a word or words (or if
> the anchor from the page that linked to it had the word)
> 2.) save all images above a certain size (or resolution)
> 3.) save all PDF files (no filter)
>
> I'd like that these files (HTML, images, and PDFs) be saved into a separate
> folder for each domain, and somehow save the date that the file was
> modified. This date would be compared the next time a fetch is made, and an
> alert would somehow be made to let me know that a new version of one of the
> files has been fetched (ideally this alert would be done by email).
>
> I don't intend to really use Lucene/SOLR to search the contents of these
> files. I merely want that these files be fetched and organized by the
> domain URL. The modify alert allows me to then go into each of the
> directories to see view the modified files.
>
> As you can tell I'm a Nutch newbie, but could not find any related
> information in the tutorials. I appreciate any help getting such a setup
> going.
>
> Thanks,
> Ben
>