You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ali rahmani <al...@yahoo.com> on 2014/05/21 12:22:22 UTC

Re-crawl every 24 hours

Dear Sir, 
I am customizing Nutch 2.2 to crawl my seed lists which contains about 30 URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new added links. I added the following configurations to nutch-site.xml file and use the following command:

<property>
  <name>db.fetch.interval.default</name>
  <value>1800</value>
  <description>The default number of seconds between re-fetches of a page (30 days).
  </description>
</property>

<property>
  <name>db.update.purge.404</name>
  <value>true</value>
  <description>If true, updatedb will add purge records with status DB_GONE
  from the CrawlDB.
  </description>
</property>


./crawl urls/ testdb http://localhost:8983/solr 2


but whenever I run mention command, nutch goes deep and deeper.
would you please tell where is the problem ?
Regards,

Re: Re-crawl every 24 hours

Posted by Ali Nazemian <al...@gmail.com>.

Hi Ali,
It is the same problem that I faced recently. It is my concert too. I would
appreciate if somebody answer this question.
Best regards.


On Wed, May 21, 2014 at 2:52 PM, Ali rahmani <al...@yahoo.com> wrote:

> Dear Sir,
> I am customizing Nutch 2.2 to crawl my seed lists which contains about 30
> URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> added links. I added the following configurations to nutch-site.xml file
> and use the following command:
>
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>1800</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).
>   </description>
> </property>
>
> <property>
>   <name>db.update.purge.404</name>
>   <value>true</value>
>   <description>If true, updatedb will add purge records with status DB_GONE
>   from the CrawlDB.
>   </description>
> </property>
>
>
> ./crawl urls/ testdb http://localhost:8983/solr 2
>
>
> but whenever I run mention command, nutch goes deep and deeper.
> would you please tell where is the problem ?
> Regards,




-- 
A.Nazemian

RE: Re-crawl every 24 hours

Posted by Markus Jelsma <ma...@openindex.io>.

That will work, but use nutch.fetchInterval.fixed in case you use an adaptive fetch scheduler.

 
 
-----Original message-----
> From:Julien Nioche <li...@gmail.com>
> Sent: Friday 23rd May 2014 12:09
> To: user@nutch.apache.org
> Subject: Re: Re-crawl every 24 hours
> 
> Hi
> 
> This will work with 1.8 indeed. What procedure do you mean? Just add
> nutch.fetchInterval to the seeds, that's all.
> 
> J.
> 
> 
> On 23 May 2014 10:13, Ali Nazemian <al...@gmail.com> wrote:
> 
> > Dear Julien,
> > Hi,
> > Do you know any step by step guide for this procedure? Is this the same for
> > nutch 1.8?
> > Best regards.
> >
> >
> > On Wed, May 21, 2014 at 6:43 PM, Julien Nioche <
> > lists.digitalpebble@gmail.com> wrote:
> >
> > > <property>
> > >   <name>db.fetch.interval.default</name>
> > >   <value>1800</value>
> > >   <description>The default number of seconds between re-fetches of a page
> > > (30 days).
> > >   </description>
> > > </property>
> > >
> > > means that a page which has already been fetched will be refetched again
> > > after 30mins. This is what you want for the seeds but is also applied to
> > > the subpages you've already discovered in previous rounds.
> > >
> > > What you could do would be to set a custom fetch interval for the seeds
> > > only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
> > > nutch.fetchInterval) and have a larger value for
> > db.fetch.interval.default.
> > > This way the seeds would be revisited frequently but not the subpages.
> > Note
> > > that this would work only if the links to the pages you want to discover
> > > are directly in the seed files. If they are at a deeper level then they'd
> > > be discovered only when the page that mentions them is re-fetched (==
> > > nutch.fetchInterval)
> > >
> > > HTH
> > >
> > > Julien
> > >
> > >
> > > On 21 May 2014 11:22, Ali rahmani <al...@yahoo.com> wrote:
> > >
> > > > Dear Sir,
> > > > I am customizing Nutch 2.2 to crawl my seed lists which contains about
> > 30
> > > > URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> > > > added links. I added the following configurations to nutch-site.xml
> > file
> > > > and use the following command:
> > > >
> > > > <property>
> > > >   <name>db.fetch.interval.default</name>
> > > >   <value>1800</value>
> > > >   <description>The default number of seconds between re-fetches of a
> > page
> > > > (30 days).
> > > >   </description>
> > > > </property>
> > > >
> > > > <property>
> > > >   <name>db.update.purge.404</name>
> > > >   <value>true</value>
> > > >   <description>If true, updatedb will add purge records with status
> > > DB_GONE
> > > >   from the CrawlDB.
> > > >   </description>
> > > > </property>
> > > >
> > > >
> > > > ./crawl urls/ testdb http://localhost:8983/solr 2
> > > >
> > > >
> > > > but whenever I run mention command, nutch goes deep and deeper.
> > > > would you please tell where is the problem ?
> > > > Regards,
> > >
> > >
> > >
> > >
> > > --
> > >
> > > Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
> 
> 
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Re-crawl every 24 hours

Posted by Ali rahmani <al...@yahoo.com>.

Hi Julien, 
Would you please guide me how a re-Crawling Script should be. I pass following steps(even after adding fetch.interval parameter), crawler goes deep and deeper. 
1) ./nutch Inject /url
2)Loop{
./nutch generate -topN 2000
./nutch fetch [CrwalID]
./nutch parse [CrawlID]
./nutch generatedb
}

It is worth mention to say that I pass mentioned steps after 24 hours.
Regards,
A.R


On Friday, May 23, 2014 2:39:13 PM, Julien Nioche <li...@gmail.com> wrote:
 


Hi

This will work with 1.8 indeed. What procedure do you mean? Just add
nutch.fetchInterval to the seeds, that's all.

J.


On 23 May 2014 10:13, Ali Nazemian <al...@gmail.com> wrote:

> Dear Julien,
> Hi,
> Do you know any step by step guide for this procedure? Is this the same for
> nutch 1.8?
> Best regards.
>
>
> On Wed, May 21, 2014 at 6:43 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > <property>
> >   <name>db.fetch.interval.default</name>
> >  
 <value>1800</value>
> >   <description>The default number of seconds between re-fetches of a page
> > (30 days).
> >   </description>
> > </property>
> >
> > means that a page which has already been fetched will be refetched again
> > after 30mins. This is what you want for the seeds but is also applied to
> > the subpages you've already discovered in previous rounds.
> >
> > What you could do would be to set a custom fetch interval for the seeds
> > only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
> > nutch.fetchInterval) and have a larger
 value for
> db.fetch.interval.default.
> > This way the seeds would be revisited frequently but not the subpages.
> Note
> > that this would work only if the links to the pages you want to discover
> > are directly in the seed files. If they are at a deeper level then they'd
> > be discovered only when the page that mentions them is re-fetched (==
> > nutch.fetchInterval)
> >
> > HTH
> >
> > Julien
> >
> >
> > On 21 May 2014 11:22, Ali rahmani <al...@yahoo.com> wrote:
> >
> > > Dear Sir,
> >
 > I am customizing Nutch 2.2 to crawl my seed lists which contains about
> 30
> > > URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> > > added links. I added the following configurations to nutch-site.xml
> file
> > > and use the following command:
> > >
> > > <property>
> > >   <name>db.fetch.interval.default</name>
> > >   <value>1800</value>
> > >   <description>The default number of seconds between re-fetches of a
> page
> > > (30 days).
> > >   </description>
> > > </property>
> > >
>
 > > <property>
> > >   <name>db.update.purge.404</name>
> > >   <value>true</value>
> > >   <description>If true, updatedb will add purge records with status
> > DB_GONE
> > >   from the CrawlDB.
> > >   </description>
> > > </property>
> > >
> > >
> > > ./crawl urls/ testdb http://localhost:8983/solr 2
> > >
> > >
> > > but whenever I run mention command, nutch goes deep and deeper.
> > > would you please tell where is the problem ?
> > >
 Regards,
> >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
>
>
> --
> A.Nazemian

>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Re-crawl every 24 hours

Posted by Julien Nioche <li...@gmail.com>.

Hi

This will work with 1.8 indeed. What procedure do you mean? Just add
nutch.fetchInterval to the seeds, that's all.

J.


On 23 May 2014 10:13, Ali Nazemian <al...@gmail.com> wrote:

> Dear Julien,
> Hi,
> Do you know any step by step guide for this procedure? Is this the same for
> nutch 1.8?
> Best regards.
>
>
> On Wed, May 21, 2014 at 6:43 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > <property>
> >   <name>db.fetch.interval.default</name>
> >   <value>1800</value>
> >   <description>The default number of seconds between re-fetches of a page
> > (30 days).
> >   </description>
> > </property>
> >
> > means that a page which has already been fetched will be refetched again
> > after 30mins. This is what you want for the seeds but is also applied to
> > the subpages you've already discovered in previous rounds.
> >
> > What you could do would be to set a custom fetch interval for the seeds
> > only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
> > nutch.fetchInterval) and have a larger value for
> db.fetch.interval.default.
> > This way the seeds would be revisited frequently but not the subpages.
> Note
> > that this would work only if the links to the pages you want to discover
> > are directly in the seed files. If they are at a deeper level then they'd
> > be discovered only when the page that mentions them is re-fetched (==
> > nutch.fetchInterval)
> >
> > HTH
> >
> > Julien
> >
> >
> > On 21 May 2014 11:22, Ali rahmani <al...@yahoo.com> wrote:
> >
> > > Dear Sir,
> > > I am customizing Nutch 2.2 to crawl my seed lists which contains about
> 30
> > > URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> > > added links. I added the following configurations to nutch-site.xml
> file
> > > and use the following command:
> > >
> > > <property>
> > >   <name>db.fetch.interval.default</name>
> > >   <value>1800</value>
> > >   <description>The default number of seconds between re-fetches of a
> page
> > > (30 days).
> > >   </description>
> > > </property>
> > >
> > > <property>
> > >   <name>db.update.purge.404</name>
> > >   <value>true</value>
> > >   <description>If true, updatedb will add purge records with status
> > DB_GONE
> > >   from the CrawlDB.
> > >   </description>
> > > </property>
> > >
> > >
> > > ./crawl urls/ testdb http://localhost:8983/solr 2
> > >
> > >
> > > but whenever I run mention command, nutch goes deep and deeper.
> > > would you please tell where is the problem ?
> > > Regards,
> >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
>
>
> --
> A.Nazemian
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Re-crawl every 24 hours

Posted by Ali Nazemian <al...@gmail.com>.

Dear Julien,
Hi,
Do you know any step by step guide for this procedure? Is this the same for
nutch 1.8?
Best regards.


On Wed, May 21, 2014 at 6:43 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> <property>
>   <name>db.fetch.interval.default</name>
>   <value>1800</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).
>   </description>
> </property>
>
> means that a page which has already been fetched will be refetched again
> after 30mins. This is what you want for the seeds but is also applied to
> the subpages you've already discovered in previous rounds.
>
> What you could do would be to set a custom fetch interval for the seeds
> only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
> nutch.fetchInterval) and have a larger value for db.fetch.interval.default.
> This way the seeds would be revisited frequently but not the subpages. Note
> that this would work only if the links to the pages you want to discover
> are directly in the seed files. If they are at a deeper level then they'd
> be discovered only when the page that mentions them is re-fetched (==
> nutch.fetchInterval)
>
> HTH
>
> Julien
>
>
> On 21 May 2014 11:22, Ali rahmani <al...@yahoo.com> wrote:
>
> > Dear Sir,
> > I am customizing Nutch 2.2 to crawl my seed lists which contains about 30
> > URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> > added links. I added the following configurations to nutch-site.xml file
> > and use the following command:
> >
> > <property>
> >   <name>db.fetch.interval.default</name>
> >   <value>1800</value>
> >   <description>The default number of seconds between re-fetches of a page
> > (30 days).
> >   </description>
> > </property>
> >
> > <property>
> >   <name>db.update.purge.404</name>
> >   <value>true</value>
> >   <description>If true, updatedb will add purge records with status
> DB_GONE
> >   from the CrawlDB.
> >   </description>
> > </property>
> >
> >
> > ./crawl urls/ testdb http://localhost:8983/solr 2
> >
> >
> > but whenever I run mention command, nutch goes deep and deeper.
> > would you please tell where is the problem ?
> > Regards,
>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
A.Nazemian

Re: crawl every 24 hours

Posted by al...@aim.com.

Hi,

Another way of doing this is to increase 

db.fetch.interval.default

to x years and inject each time the original seed. In this way you will fetch only new pages during x year, since injected urls fetch time is set to current time (I believe, you can double check it first) and the other fetched pages only will be picked up after x year.

HTH.
Alex 

-----Original Message-----
From: Julien Nioche <li...@gmail.com>
To: user <us...@nutch.apache.org>; Ali rahmani <al...@yahoo.com>
Sent: Wed, May 21, 2014 7:14 am
Subject: Re: Re-crawl every 24 hours

<property>
  <name>db.fetch.interval.default</name>
  <value>1800</value>
  <description>The default number of seconds between re-fetches of a page
(30 days).
  </description>
</property>

means that a page which has already been fetched will be refetched again
after 30mins. This is what you want for the seeds but is also applied to
the subpages you've already discovered in previous rounds.

What you could do would be to set a custom fetch interval for the seeds
only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
nutch.fetchInterval) and have a larger value for db.fetch.interval.default.
This way the seeds would be revisited frequently but not the subpages. Note
that this would work only if the links to the pages you want to discover
are directly in the seed files. If they are at a deeper level then they'd
be discovered only when the page that mentions them is re-fetched (==
nutch.fetchInterval)

HTH

Julien

On 21 May 2014 11:22, Ali rahmani <al...@yahoo.com> wrote:

> Dear Sir,
> I am customizing Nutch 2.2 to crawl my seed lists which contains about 30
> URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> added links. I added the following configurations to nutch-site.xml file
> and use the following command:
>
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>1800</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).
>   </description>
> </property>
>
> <property>
>   <name>db.update.purge.404</name>
>   <value>true</value>
>   <description>If true, updatedb will add purge records with status DB_GONE
>   from the CrawlDB.
>   </description>
> </property>
>
>
> ./crawl urls/ testdb http://localhost:8983/solr 2
>
>
> but whenever I run mention command, nutch goes deep and deeper.
> would you please tell where is the problem ?
> Regards,

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Re-crawl every 24 hours

Posted by Julien Nioche <li...@gmail.com>.

<property>
  <name>db.fetch.interval.default</name>
  <value>1800</value>
  <description>The default number of seconds between re-fetches of a page
(30 days).
  </description>
</property>

means that a page which has already been fetched will be refetched again
after 30mins. This is what you want for the seeds but is also applied to
the subpages you've already discovered in previous rounds.

What you could do would be to set a custom fetch interval for the seeds
only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
nutch.fetchInterval) and have a larger value for db.fetch.interval.default.
This way the seeds would be revisited frequently but not the subpages. Note
that this would work only if the links to the pages you want to discover
are directly in the seed files. If they are at a deeper level then they'd
be discovered only when the page that mentions them is re-fetched (==
nutch.fetchInterval)

HTH

Julien

On 21 May 2014 11:22, Ali rahmani <al...@yahoo.com> wrote:

> Dear Sir,
> I am customizing Nutch 2.2 to crawl my seed lists which contains about 30
> URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> added links. I added the following configurations to nutch-site.xml file
> and use the following command:
>
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>1800</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).
>   </description>
> </property>
>
> <property>
>   <name>db.update.purge.404</name>
>   <value>true</value>
>   <description>If true, updatedb will add purge records with status DB_GONE
>   from the CrawlDB.
>   </description>
> </property>
>
>
> ./crawl urls/ testdb http://localhost:8983/solr 2
>
>
> but whenever I run mention command, nutch goes deep and deeper.
> would you please tell where is the problem ?
> Regards,

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble