You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kemical <mi...@gmail.com> on 2013/01/31 13:35:41 UTC

Very long time just before fetching and just after parsing

Hi, 

After my first urls injection (2000 urls) i've generated a first segment
with topN 10000 and no depth option (is it default 5 like for crawl command?
I didn't see it in the doc)

then a first fetch/parse/update  pass

the end of parsing took a very very long time (see below)
2013-01-31 03:26:26,648 INFO  parse.ParseSegment - ParseSegment: finished at
2013-01-31 03:26:26, elapsed: 29:09:35

dump domainstats tells me i have 56393 Fetched urls and 517856 not fetched
ones

then i've tried to fetch a second segment with only topN 1000, but fetch was
stucked at this line:
2013-01-31 09:23:23,969 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
for more than 4 hours before i cancel by error.

Why those steps are taking so much time?

I'm using boilerpipe for parsing and set some meta data in my seed urls, but
it's the only "exotic" things i think i have in my configuration





--
View this message in context: http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-tp4037673.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Very long time just before fetching and just after parsing

Posted by kemical <mi...@gmail.com>.
HI,

i didn't managed to run Invertlinks and solrindex command only for some
segments since it seems those command works only for segments parent dir.
Then i've made a little change to my fetch/parse/update/index loop.

*In short:*
I generate new segments in an empty "current_segments" dir. When the crawl
is done i move the segments to the classic crawl/segments/ dir


*My Code:*
bin/nutch generate crawl/crawldb current_segments topN 50000
s1=`ls -d current_segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1

bin/nutch generate crawl/crawldb current_segments topN 50000
s1=`ls -d current_segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1

bin/nutch invertlinks crawl/linkdb -dir current_segments
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
crawl/linkdb current_segments/*

mv current_segments/* crawl/segments/

*Conclusion / Question*
>From my test i haven't seen anything wrong by doing this way. Since it's not
really the way i've found on nutch documentation, i'd rather have the
confirmation there are no side effects from other users.




--
View this message in context: http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-tp4037673p4040384.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Very long time just before fetching and just after parsing

Posted by kemical <mi...@gmail.com>.
Hi and thanks Ferdy,

It seems that since i'm using -noFilter and -noNorm with "nutch generate
..." everything is going more quicky (by the way, my version of nutch is
1.6)

Now i would like to optimize my crawling loop since i don't want to reindex
everything with solrindex, and also only add new discovered links to linkdb.

Here is my loop content :

bin/nutch generate crawl/crawldb crawl/segments  -topN 10000 -noFilter
-noNorm
s2=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2

bin/nutch generate crawl/crawldb crawl/segments  -topN 10000 -noFilter
-noNorm
s3=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s3
bin/nutch parse $s3
bin/nutch updatedb crawl/crawldb $s3

bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
crawl/linkdb crawl/segments/*

i've read the doc about invertlinks and solrindex, but i'm still not
undertanding how i can invertlinks / solrindex only for the last segments
(here $s2 and $s3).

Could someone tell me how to set my command line to something like :
bin/nutch invertlinks crawl/linkdb -dir $s2 $s3
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
crawl/linkdb $s2 $s3

I already have about 1000000 indexed urls and i don't really want to break
something by making wrong tests.


My tool will be used for press coverage (search new articles and store them
for making data reporting). So i'll need to have a quick loop so the site
database (currently 2000 urls) will always have all urls indexed (would be
critical to miss some important news just because the crawl is taking too
much time).






--
View this message in context: http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-tp4037673p4038583.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Very long time just before fetching and just after parsing

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

Not sure if it's possibly in the 2.x branch to filter/normalize just once,
but with a bit of hacking this should not be too difficult. If you filter
the input urls (injected urls) then you only need to filter the new urls in
the parser and never more again. (Ofcourse when you change the
normalize/filter rules you have to reprocess them all again).

Alternatively, what you could try is the patch in
https://issues.apache.org/jira/browse/NUTCH-1314 that limits the url
lengths. Usually a few urls can stall the process for a long time because
the regexes (in the filter/normalizer) go crazy on them.

Best is to do both.


On Fri, Feb 1, 2013 at 9:04 AM, kemical <mi...@gmail.com> wrote:

> Ok now with the generate and -noFilter -noNorm option, the fetch is
> starting
> almost directly.
>
> I would really like to have an exhaustive pipeline of how
> filtering/normalizing urls are done among all different steps of a crawl to
> understand side effects of what i'm doing.
>
> From what i've found updatedb can also filter/normalize urls, but it also
> normalize crawldb urls (which should take a very long time too). What i
> want
> (i think ^^) is to filter/normalize only discovered urls once. Is there a
> way to do that? Or am i completely wrong?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-tp4037673p4037886.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Ferdy Galema*
Kalooga Development

-- 

*Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer now!<http://spitsnieuws.nl/archives/entertainment/2012/12/huis-amy-winehouse-levert-weinig-op>
Kalooga

Helperpark 288
9723 ZA Groningen
The Netherlands
+31 50 2103400

www.kalooga.com
info@kalooga.comKalooga EMEA

53 Davies Street
W1K 5JH London
United Kingdom
+44 20 7129 1430Kalooga Spain and LatAM

Maria de Sevilla Diago No 3
28022 Madrid - Madrid
Spain
+34 670 580 872



Re: Very long time just before fetching and just after parsing

Posted by kemical <mi...@gmail.com>.
Ok now with the generate and -noFilter -noNorm option, the fetch is starting
almost directly.

I would really like to have an exhaustive pipeline of how
filtering/normalizing urls are done among all different steps of a crawl to
understand side effects of what i'm doing.

>From what i've found updatedb can also filter/normalize urls, but it also
normalize crawldb urls (which should take a very long time too). What i want
(i think ^^) is to filter/normalize only discovered urls once. Is there a
way to do that? Or am i completely wrong?



--
View this message in context: http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-tp4037673p4037886.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Very long time just before fetching and just after parsing

Posted by kemical <mi...@gmail.com>.
After testting: grabbing urls to fetch from unfetch urls takes 15 hours
(ouch!) and fetching 1000 urls only take some minutes (idem for parsing)

I'm guessing one of those phase is taking a very long time:
2013-01-31 13:46:19,387 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2013-01-31 13:46:19,387 INFO  crawl.Generator - Generator: filtering: true
2013-01-31 13:46:19,387 INFO  crawl.Generator - Generator: normalizing: true

Does someone know how to log each of those steps? Or have any clue about
what happened?





--
View this message in context: http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-tp4037673p4037881.html
Sent from the Nutch - User mailing list archive at Nabble.com.