You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Erwin Gunadi <fe...@gmail.com> on 2014/02/10 13:04:25 UTC

Question about fetch interval value

Hi,

 

I have a question the behavior of using AdaptiveFetchSchedule in combination
of "db.fetch.interval.default".

I know that one should configure:

-          db.fetch.schedule.adaptive.min_interval

-          db.fetch.schedule.adaptive.max_interval

In order to use AdaptiveFetchSchedule.

 

But I've been having strange behavior during crawling, because it always
tried to re-fetch with the value of "db.fetch.interval.default".

 

Thank you for your help.

 

Best Regards

Erwin

Re: good configuration for crawl image only with nutch

Posted by feng lu <am...@gmail.com>.

yes, you can not only crawl image unless you know all image URLs first. The
image URL has extracted from html page. so if you want to crawl image, you
need to crawl a html page contain that image.




On Tue, Feb 11, 2014 at 6:02 AM, Eyeris RodrIguez Rueda <er...@uci.cu>wrote:

> Hi.
>
> I need to configure nutch for crawl image only, but i have not good
> results with configuration of filter of url specially with suffix and regex.
>
> Is posible to read a good configuration for crawl image(png,gif,jpg, and
> other) only with nutch ?.
>
> the problem with filters is that nutch must parse html and all document
> for discover new links but not index them in solr and if i restrict html
> with this filters nutch say nor url to fetch.
> please any help will be appreciated.
>
> Im using nutch 1.5.1 and solr 3.6.
>
> ________________________________________________________________________________________________
> III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero
> del 2014. Ver www.uci.cu
>



-- 
Don't Grow Old, Grow Up... :-)

good configuration for crawl image only with nutch

Posted by Eyeris RodrIguez Rueda <er...@uci.cu>.

Hi.

I need to configure nutch for crawl image only, but i have not good results with configuration of filter of url specially with suffix and regex.

Is posible to read a good configuration for crawl image(png,gif,jpg, and other) only with nutch ?.

the problem with filters is that nutch must parse html and all document for discover new links but not index them in solr and if i restrict html with this filters nutch say nor url to fetch.
please any help will be appreciated.

Im using nutch 1.5.1 and solr 3.6.
________________________________________________________________________________________________
III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 2014. Ver www.uci.cu

RE: Question about fetch interval value

Posted by Erwin Gunadi <fe...@gmail.com>.

Hi Talat,

Thank you for the hint. I'll look at it and try to upgrade my nutch copy.

Best Regards
Erwin


-----Original Message-----
From: Talat Uyarer [mailto:talat@uyarer.com] 
Sent: Tuesday, February 11, 2014 6:24 AM
To: user@nutch.apache.org
Subject: RE: Question about fetch interval value

Hi Erwin,
This a bug at 2.2.1. I fix this. You can look at
https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-1651

Talat
10 Şub 2014 15:44 tarihinde "Erwin Gunadi" <fe...@gmail.com> yazdı:

> Hi Markus,
>
> Thank you for the reply.
> yes I've set the required class to AdaptiveFetchSchedule in the nutch.xml.
>
> I'm using Nutch 2.1.
>
> Best Regards
> Erwin
>
>
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Monday, February 10, 2014 1:59 PM
> To: user@nutch.apache.org
> Subject: RE: Question about fetch interval value
>
> did you set
>
>   <property>
>    <name>db.fetch.schedule.class</name>
>    <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
>   </property>
>
> as well? The other settings not mandatory, they have defaults.
>
>
> -----Original message-----
> > From:Erwin Gunadi <fe...@gmail.com>
> > Sent: Monday 10th February 2014 13:05
> > To: user@nutch.apache.org
> > Subject: Question about fetch interval value
> >
> > Hi,
> >
> >
> >
> > I have a question the behavior of using AdaptiveFetchSchedule in 
> > combination of "db.fetch.interval.default".
> >
> > I know that one should configure:
> >
> > -          db.fetch.schedule.adaptive.min_interval
> >
> > -          db.fetch.schedule.adaptive.max_interval
> >
> > In order to use AdaptiveFetchSchedule.
> >
> >
> >
> > But I've been having strange behavior during crawling, because it 
> > always tried to re-fetch with the value of "db.fetch.interval.default".
> >
> >
> >
> > Thank you for your help.
> >
> >
> >
> > Best Regards
> >
> > Erwin
> >
> >
>
>

RE: Question about fetch interval value

Posted by Talat Uyarer <ta...@uyarer.com>.

Hi Erwin,
This a bug at 2.2.1. I fix this. You can look at
https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-1651

Talat
10 Şub 2014 15:44 tarihinde "Erwin Gunadi" <fe...@gmail.com> yazdı:

> Hi Markus,
>
> Thank you for the reply.
> yes I've set the required class to AdaptiveFetchSchedule in the nutch.xml.
>
> I'm using Nutch 2.1.
>
> Best Regards
> Erwin
>
>
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Monday, February 10, 2014 1:59 PM
> To: user@nutch.apache.org
> Subject: RE: Question about fetch interval value
>
> did you set
>
>   <property>
>    <name>db.fetch.schedule.class</name>
>    <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
>   </property>
>
> as well? The other settings not mandatory, they have defaults.
>
>
> -----Original message-----
> > From:Erwin Gunadi <fe...@gmail.com>
> > Sent: Monday 10th February 2014 13:05
> > To: user@nutch.apache.org
> > Subject: Question about fetch interval value
> >
> > Hi,
> >
> >
> >
> > I have a question the behavior of using AdaptiveFetchSchedule in
> > combination of "db.fetch.interval.default".
> >
> > I know that one should configure:
> >
> > -          db.fetch.schedule.adaptive.min_interval
> >
> > -          db.fetch.schedule.adaptive.max_interval
> >
> > In order to use AdaptiveFetchSchedule.
> >
> >
> >
> > But I've been having strange behavior during crawling, because it
> > always tried to re-fetch with the value of "db.fetch.interval.default".
> >
> >
> >
> > Thank you for your help.
> >
> >
> >
> > Best Regards
> >
> > Erwin
> >
> >
>
>

RE: Question about fetch interval value

Posted by Erwin Gunadi <fe...@gmail.com>.

Hi Markus,

Thank you for the reply.
yes I've set the required class to AdaptiveFetchSchedule in the nutch.xml.

I'm using Nutch 2.1.

Best Regards
Erwin
 

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Monday, February 10, 2014 1:59 PM
To: user@nutch.apache.org
Subject: RE: Question about fetch interval value

did you set

  <property>
   <name>db.fetch.schedule.class</name>
   <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
  </property>

as well? The other settings not mandatory, they have defaults.
 
 
-----Original message-----
> From:Erwin Gunadi <fe...@gmail.com>
> Sent: Monday 10th February 2014 13:05
> To: user@nutch.apache.org
> Subject: Question about fetch interval value
> 
> Hi,
> 
>  
> 
> I have a question the behavior of using AdaptiveFetchSchedule in 
> combination of "db.fetch.interval.default".
> 
> I know that one should configure:
> 
> -          db.fetch.schedule.adaptive.min_interval
> 
> -          db.fetch.schedule.adaptive.max_interval
> 
> In order to use AdaptiveFetchSchedule.
> 
>  
> 
> But I've been having strange behavior during crawling, because it 
> always tried to re-fetch with the value of "db.fetch.interval.default".
> 
>  
> 
> Thank you for your help.
> 
>  
> 
> Best Regards
> 
> Erwin
> 
>

Re: Question about fetch interval value

Posted by RAHUL KATARE <ra...@gmail.com>.

Hi,

I am using Nutch-1.7 for crawling and getting the crawled data in
crawl/segments in HDFS. I want to get the structured data using
Apache-Tika. Can someone suggest me some reference on how to parse the
crawled data by Nutch using Apache-Tika?

Regards,
Rahul


On Mon, Feb 10, 2014 at 6:29 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> did you set
>
>   <property>
>    <name>db.fetch.schedule.class</name>
>    <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
>   </property>
>
> as well? The other settings not mandatory, they have defaults.
>
>
> -----Original message-----
> > From:Erwin Gunadi <fe...@gmail.com>
> > Sent: Monday 10th February 2014 13:05
> > To: user@nutch.apache.org
> > Subject: Question about fetch interval value
> >
> > Hi,
> >
> >
> >
> > I have a question the behavior of using AdaptiveFetchSchedule in
> combination
> > of "db.fetch.interval.default".
> >
> > I know that one should configure:
> >
> > -          db.fetch.schedule.adaptive.min_interval
> >
> > -          db.fetch.schedule.adaptive.max_interval
> >
> > In order to use AdaptiveFetchSchedule.
> >
> >
> >
> > But I've been having strange behavior during crawling, because it always
> > tried to re-fetch with the value of "db.fetch.interval.default".
> >
> >
> >
> > Thank you for your help.
> >
> >
> >
> > Best Regards
> >
> > Erwin
> >
> >
>

RE: Question about fetch interval value

Posted by Markus Jelsma <ma...@openindex.io>.

did you set

  <property>
   <name>db.fetch.schedule.class</name>
   <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
  </property>

as well? The other settings not mandatory, they have defaults.
 
 
-----Original message-----
> From:Erwin Gunadi <fe...@gmail.com>
> Sent: Monday 10th February 2014 13:05
> To: user@nutch.apache.org
> Subject: Question about fetch interval value
> 
> Hi,
> 
>  
> 
> I have a question the behavior of using AdaptiveFetchSchedule in combination
> of "db.fetch.interval.default".
> 
> I know that one should configure:
> 
> -          db.fetch.schedule.adaptive.min_interval
> 
> -          db.fetch.schedule.adaptive.max_interval
> 
> In order to use AdaptiveFetchSchedule.
> 
>  
> 
> But I've been having strange behavior during crawling, because it always
> tried to re-fetch with the value of "db.fetch.interval.default".
> 
>  
> 
> Thank you for your help.
> 
>  
> 
> Best Regards
> 
> Erwin
> 
>