You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Aldo Armiento <al...@armiento.com> on 2005/03/05 23:18:14 UTC

Crowling specific pages

Hi All...

I need to crawl about 20 sites. Sites structure is:

ENTRY-PAGE (http://www.example.com/lists.html) with a list of links to sub-
pages (http://www.example.com/sub-page1.html ... sub-pageN.html).

I need to:

- fetch _always_ the ENTRY-PAGE (list of links to sub-pages);
- if a sub-pages (URL) is on DB, *don't fetch it* (for keep bandwith lower);
- if a sub-pages is not on DB, *fetch it*.

I need to run this about 2/3 times in a day for all 20 sites.

Is possible to use nutch? If yes, how configure it for this scope?

Thanks you,
Duc.

P.S. I notice that I need to restart apache tomcat to search in new segments. 
How is possible without restart it?

Re: Crowling specific pages

Posted by sub paul <su...@gmail.com>.

is touching web.xml the only option? That still restarts the whole web
application. Can we somehow force re-initialize the NutchBean? Would
that work with the new updated index?

Regards,
Paul



On Sun, 6 Mar 2005 16:23:43 +0100, Stefan Groschupf <sg...@media-style.com> wrote:
> > I suppose that I have to reduce "db.default.fetch.interval"? Or not?
> As far I know this is an integer value so you can not setup a half day.
> 
> > What I need is not re-fetchs always all pages, but only the
> > "LIST-OF-INTERNAL-
> > LINKS" page and, if there are on this new links, fetch this new pages.
> > I'm sure
> > that pages I've fetched are not changed. The only page that can change
> > is
> > the "LIST-OF-INTERNAL-LINKS".
> > Is it possible?
> well. may be you have to do some tricks.
> Check if different regular expression limitation can help or think
> about to implement a own url filter plugin.
> 
> > P.S. Where can I find documentation about Nutch crawler?
> >
> code, wiki, mailing list and search for papers of nutch in
> scholar.google.com.
> :-)
> 
> Stefan
> 
>

Re: Crowling specific pages

Posted by Stefan Groschupf <sg...@media-style.com>.

> I suppose that I have to reduce "db.default.fetch.interval"? Or not?
As far I know this is an integer value so you can not setup a half day.

> What I need is not re-fetchs always all pages, but only the 
> "LIST-OF-INTERNAL-
> LINKS" page and, if there are on this new links, fetch this new pages. 
> I'm sure
> that pages I've fetched are not changed. The only page that can change 
> is
> the "LIST-OF-INTERNAL-LINKS".
> Is it possible?
well. may be you have to do some tricks.
Check if different regular expression limitation can help or think 
about to implement a own url filter plugin.

> P.S. Where can I find documentation about Nutch crawler?
>
code, wiki, mailing list and search for papers of nutch in 
scholar.google.com.
:-)

Stefan

Re: Crowling specific pages

Posted by Aldo Armiento <al...@armiento.com>.

Dear Stefan,
thanks you for your reply.

I suppose that I have to reduce "db.default.fetch.interval"? Or not?

What I need is not re-fetchs always all pages, but only the "LIST-OF-INTERNAL-
LINKS" page and, if there are on this new links, fetch this new pages. I'm sure 
that pages I've fetched are not changed. The only page that can change is 
the "LIST-OF-INTERNAL-LINKS". 
Is it possible?

Thanks,
Aldo.

P.S. Where can I find documentation about Nutch crawler?

Citazione Stefan Groschupf <sg...@media-style.com>:

> > Is possible to use nutch? If yes, how configure it for this scope?
> 
> Just write a shell scrip that does all steps described in the whole web
> crawl tutorial and run the script with a never ending loop and sleep or
> use a crown job.
> >
> > P.S. I notice that I need to restart apache tomcat to search in new
> > segments.
> > How is possible without restart it?
> >
> Touch the web.xml.
> 
> HTH
> Stefan
> 
> 


-- 
Aldo Armiento
aldo@armiento.com

Re: Crowling specific pages

Posted by Stefan Groschupf <sg...@media-style.com>.

> Is possible to use nutch? If yes, how configure it for this scope?

Just write a shell scrip that does all steps described in the whole web 
crawl tutorial and run the script with a never ending loop and sleep or 
use a crown job.
>
> P.S. I notice that I need to restart apache tomcat to search in new 
> segments.
> How is possible without restart it?
>
Touch the web.xml.

HTH
Stefan