You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Eggebrecht, Thomas (GfK Marktforschung)" <th...@gfk.com> on 2011/08/29 17:33:48 UTC

Parameter tuning or how to accelerate fetching

Dear List,

My process fetches only 10 but very big domains with millions of pages on each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles only a handful of about 30,000 pages and it seems stagnating.

How would you accelerate fetching?

My current parameters (using Nutch-1.2):
topN: 40,000
depth: 8
adddays: 30
fetcher.server.delay: 1
db.max.outlinks.per.page: 500

All parameters not mentioned have standard values as well as regex-urlfilter.txt.

Best Regards
Thomas


________________________________

GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged information. Please note that unauthorized copying, disclosure or distribution of the material in this email is not permitted.

Re: Parameter tuning or how to accelerate fetching

Posted by Markus Jelsma <ma...@openindex.io>.

This is most likely due to politeness. Check robots.txt.

> Dear List,
> 
> My process fetches only 10 but very big domains with millions of pages on
> each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles
> only a handful of about 30,000 pages and it seems stagnating.
> 
> How would you accelerate fetching?
> 
> My current parameters (using Nutch-1.2):
> topN: 40,000
> depth: 8
> adddays: 30
> fetcher.server.delay: 1
> db.max.outlinks.per.page: 500
> 
> All parameters not mentioned have standard values as well as
> regex-urlfilter.txt.
> 
> Best Regards
> Thomas
> 
> 
> ________________________________
> 
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This email
> and any attachments may contain confidential or privileged information.
> Please note that unauthorized copying, disclosure or distribution of the
> material in this email is not permitted.

Re: AW: Parameter tuning or how to accelerate fetching

Posted by Julien Nioche <li...@gmail.com>.

or even without any restrictions from robots.txt : by default Nutch waits 5
secs between fetchs from a same host. If you have 100K URLs from a single
host then it will take at least 138 hours just to fetch them

On 30 August 2011 11:23, Markus Jelsma <ma...@openindex.io> wrote:

> Your questions was valid: why is my fetch too slow and how to accelerate?
>
> Again, first check your robots.txt. With so few domains it's almost certain
> that politeness is the problem here.
>
> > Hi List,
> > Hi Hannes,
> >
> > All logs are without Errors and Warnings. Injecting, Updating, merging
> and
> > indexing is not a problem and takes minutes only. One cycle takes 2 days
> > with my parameters. Regex-urlfilter.txt is checked against the URL format
> > from all sites.
> >
> > But I'm sorry to the list, I may have not clear asked. I'm interested
> > mainly why there is such big difference between fetched and unfetched
> URLs
> > and what can I do to force fetching?
> >
> > Please see my current readdb -stats output:
> > TOTAL urls: 1698520
> > [...]
> > status 1 (db_unfetched): 1567047
> > status 2 (db_fetched): 90399
> > status 3 (db_gone): 11696
> > status 4 (db_redir_temp): 4065
> > status 5 (db_redir_perm): 10137
> > status 6 (db_notmodified): 15176
> >
> > The process runs now exactly 30 days. In the meantime I have now 90,399
> > fetched instead of 30,000 after 15 days. Is this normal?
> >
> > Regards
> > Thomas
> >
> > Von: Hannes Carl Meyer [mailto:hannescarl@googlemail.com]
> > Gesendet: Dienstag, 30. August 2011 09:25
> > An: user@nutch.apache.org
> > Cc: Eggebrecht, Thomas (GfK Marktforschung)
> > Betreff: Re: Parameter tuning or how to accelerate fetching
> >
> > Hi Thomas,
> >
> > first, 30,000 pages in two weeks is somewhat of few...
> >
> > where did you get the total number of pages from? By Crawl-DB?
> > Please post a bin/nutch readdb crawldb/ -stats output here.
> >
> > How long does one cycle takes?
> >
> > If your regex-urlfilter.txt is still the standard setting, check your
> > websites for common query URLs containing like
> > "index.php?param=value&param1..". The standard regex-urlfilter is
> > sometimes very strict in this case.
> >
> > BR
> >
> > Hannes
> >
> > --
> >
> > https://www.xing.com/profile/HannesCarl_Meyer
> > http://de.linkedin.com/in/hannescarlmeyer
> > On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK Marktforschung)
> > <th...@gfk.com>> wrote:
> Dear
> > List,
> >
> > My process fetches only 10 but very big domains with millions of pages on
> > each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles
> > only a handful of about 30,000 pages and it seems stagnating.
> >
> > How would you accelerate fetching?
> >
> > My current parameters (using Nutch-1.2):
> > topN: 40,000
> > depth: 8
> > adddays: 30
> > fetcher.server.delay: 1
> > db.max.outlinks.per.page: 500
> >
> > All parameters not mentioned have standard values as well as
> > regex-urlfilter.txt.
> >
> > Best Regards
> > Thomas
> >
> >
> > ________________________________
> >
> > GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> > Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> > (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent,
> Wilhelm
> > R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This
> email
> > and any attachments may contain confidential or privileged information.
> > Please note that unauthorized copying, disclosure or distribution of the
> > material in this email is not permitted.
> >
> >
> >
> > ________________________________
> >
> > GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> > Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> > (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent,
> Wilhelm
> > R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This
> email
> > and any attachments may contain confidential or privileged information.
> > Please note that unauthorized copying, disclosure or distribution of the
> > material in this email is not permitted.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: AW: Parameter tuning or how to accelerate fetching

Posted by Markus Jelsma <ma...@openindex.io>.

Your questions was valid: why is my fetch too slow and how to accelerate?

Again, first check your robots.txt. With so few domains it's almost certain 
that politeness is the problem here.

> Hi List,
> Hi Hannes,
> 
> All logs are without Errors and Warnings. Injecting, Updating, merging and
> indexing is not a problem and takes minutes only. One cycle takes 2 days
> with my parameters. Regex-urlfilter.txt is checked against the URL format
> from all sites.
> 
> But I'm sorry to the list, I may have not clear asked. I'm interested
> mainly why there is such big difference between fetched and unfetched URLs
> and what can I do to force fetching?
> 
> Please see my current readdb -stats output:
> TOTAL urls: 1698520
> [...]
> status 1 (db_unfetched): 1567047
> status 2 (db_fetched): 90399
> status 3 (db_gone): 11696
> status 4 (db_redir_temp): 4065
> status 5 (db_redir_perm): 10137
> status 6 (db_notmodified): 15176
> 
> The process runs now exactly 30 days. In the meantime I have now 90,399
> fetched instead of 30,000 after 15 days. Is this normal?
> 
> Regards
> Thomas
> 
> Von: Hannes Carl Meyer [mailto:hannescarl@googlemail.com]
> Gesendet: Dienstag, 30. August 2011 09:25
> An: user@nutch.apache.org
> Cc: Eggebrecht, Thomas (GfK Marktforschung)
> Betreff: Re: Parameter tuning or how to accelerate fetching
> 
> Hi Thomas,
> 
> first, 30,000 pages in two weeks is somewhat of few...
> 
> where did you get the total number of pages from? By Crawl-DB?
> Please post a bin/nutch readdb crawldb/ -stats output here.
> 
> How long does one cycle takes?
> 
> If your regex-urlfilter.txt is still the standard setting, check your
> websites for common query URLs containing like
> "index.php?param=value&param1..". The standard regex-urlfilter is
> sometimes very strict in this case.
> 
> BR
> 
> Hannes
> 
> --
> 
> https://www.xing.com/profile/HannesCarl_Meyer
> http://de.linkedin.com/in/hannescarlmeyer
> On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK Marktforschung)
> <th...@gfk.com>> wrote: Dear
> List,
> 
> My process fetches only 10 but very big domains with millions of pages on
> each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles
> only a handful of about 30,000 pages and it seems stagnating.
> 
> How would you accelerate fetching?
> 
> My current parameters (using Nutch-1.2):
> topN: 40,000
> depth: 8
> adddays: 30
> fetcher.server.delay: 1
> db.max.outlinks.per.page: 500
> 
> All parameters not mentioned have standard values as well as
> regex-urlfilter.txt.
> 
> Best Regards
> Thomas
> 
> 
> ________________________________
> 
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This email
> and any attachments may contain confidential or privileged information.
> Please note that unauthorized copying, disclosure or distribution of the
> material in this email is not permitted.
> 
> 
> 
> ________________________________
> 
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This email
> and any attachments may contain confidential or privileged information.
> Please note that unauthorized copying, disclosure or distribution of the
> material in this email is not permitted.

AW: Parameter tuning or how to accelerate fetching

Posted by "Eggebrecht, Thomas (GfK Marktforschung)" <th...@gfk.com>.

Hi List,
Hi Hannes,

All logs are without Errors and Warnings. Injecting, Updating, merging and indexing is not a problem and takes minutes only. One cycle takes 2 days with my parameters. Regex-urlfilter.txt is checked against the URL format from all sites.

But I'm sorry to the list, I may have not clear asked. I'm interested mainly why there is such big difference between fetched and unfetched URLs and what can I do to force fetching?

Please see my current readdb -stats output:
TOTAL urls: 1698520
[...]
status 1 (db_unfetched): 1567047
status 2 (db_fetched): 90399
status 3 (db_gone): 11696
status 4 (db_redir_temp): 4065
status 5 (db_redir_perm): 10137
status 6 (db_notmodified): 15176

The process runs now exactly 30 days. In the meantime I have now 90,399 fetched instead of 30,000 after 15 days. Is this normal?

Regards
Thomas

Von: Hannes Carl Meyer [mailto:hannescarl@googlemail.com]
Gesendet: Dienstag, 30. August 2011 09:25
An: user@nutch.apache.org
Cc: Eggebrecht, Thomas (GfK Marktforschung)
Betreff: Re: Parameter tuning or how to accelerate fetching

Hi Thomas,

first, 30,000 pages in two weeks is somewhat of few...

where did you get the total number of pages from? By Crawl-DB?
Please post a bin/nutch readdb crawldb/ -stats output here.

How long does one cycle takes?

If your regex-urlfilter.txt is still the standard setting, check your websites for common query URLs containing like "index.php?param=value&param1..". The standard regex-urlfilter is sometimes very strict in this case.

BR

Hannes

--

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK Marktforschung) <th...@gfk.com>> wrote:
Dear List,

My process fetches only 10 but very big domains with millions of pages on each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles only a handful of about 30,000 pages and it seems stagnating.

How would you accelerate fetching?

My current parameters (using Nutch-1.2):
topN: 40,000
depth: 8
adddays: 30
fetcher.server.delay: 1
db.max.outlinks.per.page: 500

All parameters not mentioned have standard values as well as regex-urlfilter.txt.

Best Regards
Thomas


________________________________

GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged information. Please note that unauthorized copying, disclosure or distribution of the material in this email is not permitted.



________________________________

GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged information. Please note that unauthorized copying, disclosure or distribution of the material in this email is not permitted.

Re: Parameter tuning or how to accelerate fetching

Posted by Hannes Carl Meyer <ha...@googlemail.com>.

Hi Thomas,

first, 30,000 pages in two weeks is somewhat of few...

where did you get the total number of pages from? By Crawl-DB?
Please post a bin/nutch readdb crawldb/ -stats output here.

How long does one cycle takes?

If your regex-urlfilter.txt is still the standard setting, check your
websites for common query URLs containing like
"index.php?param=value&param1..". The standard regex-urlfilter is sometimes
very strict in this case.

BR

Hannes

-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer

On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK Marktforschung) <
thomas.eggebrecht@gfk.com> wrote:

> Dear List,
>
> My process fetches only 10 but very big domains with millions of pages on
> each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles
> only a handful of about 30,000 pages and it seems stagnating.
>
> How would you accelerate fetching?
>
> My current parameters (using Nutch-1.2):
> topN: 40,000
> depth: 8
> adddays: 30
> fetcher.server.delay: 1
> db.max.outlinks.per.page: 500
>
> All parameters not mentioned have standard values as well as
> regex-urlfilter.txt.
>
> Best Regards
> Thomas
>
>
> ________________________________
>
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
> This email and any attachments may contain confidential or privileged
> information. Please note that unauthorized copying, disclosure or
> distribution of the material in this email is not permitted.
>

Re: Parameter tuning or how to accelerate fetching

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Thomas,

This seems a perfect situation for running Nutch jobs in a cluster Hadoop
setup, if you have the resources. From the length of your crawl (2 weeks)
and the erecursive number of cycles, t is inherently hard for anyone, let
alone yourself begin to provide accurate answers to this query. I would
begin with logs... generic search for FATAL, WARN or ERROR as per commons
logging levels will certainly return all instances which may lead to some
kind of answers.

On Mon, Aug 29, 2011 at 4:33 PM, Eggebrecht, Thomas (GfK Marktforschung) <
thomas.eggebrecht@gfk.com> wrote:

> Dear List,
>
> My process fetches only 10 but very big domains with millions of pages on
> each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles
> only a handful of about 30,000 pages and it seems stagnating.
>
> How would you accelerate fetching?
>
> My current parameters (using Nutch-1.2):
> topN: 40,000
> depth: 8
> adddays: 30
> fetcher.server.delay: 1
> db.max.outlinks.per.page: 500
>
> All parameters not mentioned have standard values as well as
> regex-urlfilter.txt.
>
> Best Regards
> Thomas
>
>
> ________________________________
>
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
> This email and any attachments may contain confidential or privileged
> information. Please note that unauthorized copying, disclosure or
> distribution of the material in this email is not permitted.
>

-- 
*Lewis*