You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Magnús Skúlason <ma...@gmail.com> on 2012/02/19 14:53:23 UTC

ParseSegment taking a long time to finish

Hi,

According to my logs a really long time +2 hours elapses between
parsing the last page in a segment until the ParseSegment finishes as
can be seen here:

2012-02-19 00:51:43,471 INFO  parse.ParseSegment - Parsing: http:// ....
2012-02-19 03:15:18,604 INFO  parse.ParseSegment - ParseSegment:
finished at 2012-02-19 03:15:18, elapsed: 02:57:24

Since the total time of the parse job is just around 3 hours, this
represents a huge portion of the overall time

Is it normal that the last step in the job takes such a long time and
is there anything I can do to speed it up? I have been running the
generator with -topN 20000 I wouldn't have expected that to be a big
enough value to cause a problem. I have now reconfigured my script to
skip the -topN parameter to see what happens.

best regards,
Magnus

RE: ParseSegment taking a long time to finish

Posted by Markus Jelsma <ma...@openindex.io>.

No so odd after all. I should have known it started the reducer at that time, silly me. The parse went perfectly fine in 42 minutes. The problem lies in your regex.

Cheers

 
 
-----Original message-----
> From:sidbatra <si...@gmail.com>
> Sent: Tue 03-Jul-2012 00:18
> To: user@nutch.apache.org
> Subject: RE: ParseSegment taking a long time to finish
> 
> That's quite odd. Does the parser spawn multiple threads to optimize parsing
> and perhaps one of the threads hangs?
> 
> But this is very odd that it's just waiting for the next record. 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992624.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: ParseSegment taking a long time to finish

Posted by sidbatra <si...@gmail.com>.

That's quite odd. Does the parser spawn multiple threads to optimize parsing
and perhaps one of the threads hangs?

But this is very odd that it's just waiting for the next record. 

--
View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992624.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: ParseSegment taking a long time to finish

Posted by Markus Jelsma <ma...@openindex.io>.

I've modified the parser to log long running records and ran your segment. There are quite a few records that run for more than a second on one machine with 2x2.4GHz CPU. It, unfortunately doesn't show me a record it's waiting for.

I ouput a record prior to parsing and after parsing with elasped ms but it's stalling somewhere. It should stall with a `Parsing: ` entry, not a `Parsed` one.

Parsing: http://www.target.com/p/somerset-5-drawer-chest-coffee/-/A-12121682
Parsed (34ms):http://www.target.com/p/somerset-5-drawer-chest-coffee/-/A-12121682
Parsing: http://www.target.com/p/tennessee-volunteers-college-party-pack-for-16-guests/-/A-14087806
Parsed (29ms):http://www.target.com/p/tennessee-volunteers-college-party-pack-for-16-guests/-/A-14087806
Parsing: http://www.target.com/p/the-board-dudes-magnetic-dry-erase-board-14-x14/-/A-13617619
Parsed (29ms):http://www.target.com/p/the-board-dudes-magnetic-dry-erase-board-14-x14/-/A-13617619
Parsing: http://www.target.com/p/the-laws-of-charisma-hardcover/-/A-12846523
Parsed (32ms):http://www.target.com/p/the-laws-of-charisma-hardcover/-/A-12846523
..STALLS

This is with a default Nutch checkout but can be a problem of me running it local although it shouldn't.

-----Original message-----
> From:sidbatra <si...@gmail.com>
> Sent: Mon 02-Jul-2012 23:14
> To: user@nutch.apache.org
> Subject: RE: ParseSegment taking a long time to finish
> 
> Thanks a lot Markus. I'll make these changes, re-run and share the result.
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992610.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: ParseSegment taking a long time to finish

Posted by sidbatra <si...@gmail.com>.

Thanks a lot Markus. I'll make these changes, re-run and share the result.

--
View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992610.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: ParseSegment taking a long time to finish

Posted by Markus Jelsma <ma...@openindex.io>.

Regex order matters. Happy to hear the results.


Considering your hardware you should parse this amount of pages in less than an hour. And you should decrease your mapper/reducer heap size significantly, it doesn't take 4G of RAM. 1G mapper and 500M reducer is safe enough. You can then allocate more task slots and have higher throughput.

 
 
-----Original message-----
> From:sidbatra <si...@gmail.com>
> Sent: Mon 02-Jul-2012 23:02
> To: user@nutch.apache.org
> Subject: RE: ParseSegment taking a long time to finish
> 
> You already have that rule configured? 
> 
> Yes, its    -^.{350,}$
> 
> Is it one of the first simple expressions you have? 
> 
> This is an excellent point. It's not the first one. I'll move it to first
> place and see if it helps.
> 
> How many records are you processing each time, is it roughly the same for
> all segments?
> 
> It's roughly the same for all segments - 300,000 URLs per segment. Each
> segment finishes in 2.5 hours but one segment took 10 hours.
> 
>  And are you running on Hadoop or pseudo or local? 
> 
> Hadoop on Amazon EC2, one machine m2.2xlarge with 34.20 GB RAM and 13 (4
> cores x 3.25 units) compute units.
> 
> mapred.child.java.opts	-Xmx4096m
> mapred.tasktracker.map.tasks.maximum	 6
> mapred.tasktracker.reduce.tasks.maximum 2
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992605.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: ParseSegment taking a long time to finish

Posted by sidbatra <si...@gmail.com>.

You already have that rule configured? 

Yes, its    -^.{350,}$

Is it one of the first simple expressions you have? 

This is an excellent point. It's not the first one. I'll move it to first
place and see if it helps.

How many records are you processing each time, is it roughly the same for
all segments?

It's roughly the same for all segments - 300,000 URLs per segment. Each
segment finishes in 2.5 hours but one segment took 10 hours.

 And are you running on Hadoop or pseudo or local? 

Hadoop on Amazon EC2, one machine m2.2xlarge with 34.20 GB RAM and 13 (4
cores x 3.25 units) compute units.

mapred.child.java.opts	-Xmx4096m
mapred.tasktracker.map.tasks.maximum	 6
mapred.tasktracker.reduce.tasks.maximum 2

--
View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992605.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: ParseSegment taking a long time to finish

Posted by Markus Jelsma <ma...@openindex.io>.

You already have that rule configured? Is it one of the first simple expressions you have? How many records are you processing each time, is it roughly the same for all segments? And are you running on Hadoop or pseudo or local?

 
 
-----Original message-----
> From:sidbatra <si...@gmail.com>
> Sent: Mon 02-Jul-2012 22:44
> To: user@nutch.apache.org
> Subject: RE: ParseSegment taking a long time to finish
> 
> I'll run more experiments on that segment. My regex-urlfilter.txt removes
> urls longer than 350 chars.
> 
> -^.{350,}$
> 
> Any recommendations for max URL char length? or any other hypothesis that I
> can test to confirm the problem?
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992601.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: ParseSegment taking a long time to finish

Posted by sidbatra <si...@gmail.com>.

I'll run more experiments on that segment. My regex-urlfilter.txt removes
urls longer than 350 chars.

-^.{350,}$

Any recommendations for max URL char length? or any other hypothesis that I
can test to confirm the problem?

--
View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992601.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: ParseSegment taking a long time to finish

Posted by Markus Jelsma <ma...@openindex.io>.

Hi

The log output doesn't tell you what the task is actually doing, it is only Hadoop output and initialization of the URL filters. There should be no real problem with the parser job and URL filter programming in Nutch, we crawl large parts of the internet but the parser never stalls, at least not on URL filter processing. Anyway, check your CrawlDB, there could (or must) be very long URL's choking the regexes.

If your CrawlDB isn't too large you can dump it as CSV and grep for lines longer than 500 or 250 characters. You could also keep a back up of your CrawlDB and limit the URL length. You can also try parsing the bad segment with the same URL length limitting regex, it should solve the problem.

Cheers

 
 
-----Original message-----
> From:sidbatra <si...@gmail.com>
> Sent: Mon 02-Jul-2012 20:52
> To: user@nutch.apache.org
> Subject: Re: ParseSegment taking a long time to finish
> 
> I have a recent example here from the logs during Parsing:
> 
> 2012-06-30 23:46:55,763 INFO org.apache.hadoop.mapred.ReduceTask (main):
> Merging 0 segments, 0 bytes from memory into reduce
> 2012-06-30 23:46:55,763 INFO org.apache.hadoop.mapred.Merger (main): Merging
> 4 sorted segments
> 2012-06-30 23:46:55,766 INFO org.apache.hadoop.mapred.Merger (main): Down to
> the last merge-pass, with 4 segments left of total size: 960691756 bytes
> 2012-06-30 23:46:55,767 INFO org.apache.hadoop.conf.Configuration (main):
> found resource regex-urlfilter.txt at
> file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201206280129_0071/jars/regex-urlfilter.txt
> 2012-06-30 23:46:55,768 INFO org.apache.hadoop.conf.Configuration (main):
> found resource regex-normalize.xml at
> file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201206280129_0071/jars/regex-normalize.xml
> 2012-06-30 23:46:55,829 INFO
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer (main): can't
> find rules for scope 'outlink', using default
> 2012-07-01 04:42:24,802 INFO
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer (main): can't
> find rules for scope 'fetcher', using default
> 2012-07-01 08:48:03,952 INFO org.apache.hadoop.mapred.Task (main):
> Task:attempt_201206280129_0071_r_000002_0 is done. And is in the process of
> commiting
> 2012-07-01 08:48:12,698 INFO org.apache.hadoop.mapred.Task (main): Task
> 'attempt_201206280129_0071_r_000002_0' done.
> 2012-07-01 08:48:12,699 INFO org.apache.hadoop.mapred.TaskLogsTruncater
> (main): Initializing logs' truncater with mapRetainSize=-1 and
> reduceRetainSize=-1
> 
> It takes 4 hours after the each of these messages:
> RegexURLNormalizer (main): can't find rules for scope 'outlink', using
> default
> RegexURLNormalizer (main): can't find rules for scope 'fetcher', using
> default
> 
> 
> There is a recommendation somewhere on the mailing list to reduce the number
> of inlinks and outlinks. 
> 
> What is odd is that this isn't hanging issue isn't consistent with the
> number of links being parsed. Other instances of parsing finish in 1/5 the
> time.
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992586.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: ParseSegment taking a long time to finish

Posted by sidbatra <si...@gmail.com>.

I have a recent example here from the logs during Parsing:

2012-06-30 23:46:55,763 INFO org.apache.hadoop.mapred.ReduceTask (main):
Merging 0 segments, 0 bytes from memory into reduce
2012-06-30 23:46:55,763 INFO org.apache.hadoop.mapred.Merger (main): Merging
4 sorted segments
2012-06-30 23:46:55,766 INFO org.apache.hadoop.mapred.Merger (main): Down to
the last merge-pass, with 4 segments left of total size: 960691756 bytes
2012-06-30 23:46:55,767 INFO org.apache.hadoop.conf.Configuration (main):
found resource regex-urlfilter.txt at
file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201206280129_0071/jars/regex-urlfilter.txt
2012-06-30 23:46:55,768 INFO org.apache.hadoop.conf.Configuration (main):
found resource regex-normalize.xml at
file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201206280129_0071/jars/regex-normalize.xml
2012-06-30 23:46:55,829 INFO
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer (main): can't
find rules for scope 'outlink', using default
2012-07-01 04:42:24,802 INFO
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer (main): can't
find rules for scope 'fetcher', using default
2012-07-01 08:48:03,952 INFO org.apache.hadoop.mapred.Task (main):
Task:attempt_201206280129_0071_r_000002_0 is done. And is in the process of
commiting
2012-07-01 08:48:12,698 INFO org.apache.hadoop.mapred.Task (main): Task
'attempt_201206280129_0071_r_000002_0' done.
2012-07-01 08:48:12,699 INFO org.apache.hadoop.mapred.TaskLogsTruncater
(main): Initializing logs' truncater with mapRetainSize=-1 and
reduceRetainSize=-1

It takes 4 hours after the each of these messages:
RegexURLNormalizer (main): can't find rules for scope 'outlink', using
default
RegexURLNormalizer (main): can't find rules for scope 'fetcher', using
default


There is a recommendation somewhere on the mailing list to reduce the number
of inlinks and outlinks. 

What is odd is that this isn't hanging issue isn't consistent with the
number of links being parsed. Other instances of parsing finish in 1/5 the
time.

--
View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992586.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: ParseSegment taking a long time to finish

Posted by mstekel <ms...@gmail.com>.

Hi guys. Did you find a solution for this issue?

--
View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992370.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: ParseSegment taking a long time to finish

Posted by sidbatra <si...@gmail.com>.

Hi Lewis, Magnus,

here is an example segment that consistently reproduces the slowness issue
in parsing - https://dl.dropbox.com/u/4027616/segment.tar.gz


I'll appreciate if you guys have any insights.

I've documented more details here -
http://lucene.472066.n3.nabble.com/Nutch-Parse-Step-Bafflingly-Slow-in-Reduce-Step-with-example-td3988820.html

thanks,
Sid

--
View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3989072.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: ParseSegment taking a long time to finish

Posted by Magnús Skúlason <ma...@gmail.com>.

Hi,

No I didn't really find a good solution, but if I remember correctly I
deleted the crawl database. I have noticed that those jobs seem to
take longer and longer times, expected of course since the crawl
database grows every time.

I also setup a hadoop cluster and that helped a lot in increasing the
performance.

But I haven't been following my crawl process thoroughly lately so
maybe the problem is still hanging around.

best regards,
Magnus

On Thu, May 31, 2012 at 10:13 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Can someone provide and URL please?
>
> On Thu, May 31, 2012 at 9:23 PM, sidbatra <si...@gmail.com> wrote:
>> Hi Magnus,
>>
>> I'm facing the exactly the same issue with Nutch 1.4
>>
>> Did you manage to find a solution?
>>
>> thanks,
>> Sid
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3987122.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
> --
> Lewis

Re: ParseSegment taking a long time to finish

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Can someone provide and URL please?

On Thu, May 31, 2012 at 9:23 PM, sidbatra <si...@gmail.com> wrote:
> Hi Magnus,
>
> I'm facing the exactly the same issue with Nutch 1.4
>
> Did you manage to find a solution?
>
> thanks,
> Sid
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3987122.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis

Re: ParseSegment taking a long time to finish

Posted by sidbatra <si...@gmail.com>.

Hi Magnus,

I'm facing the exactly the same issue with Nutch 1.4

Did you manage to find a solution?

thanks,
Sid

--
View this message in context: http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3987122.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: ParseSegment taking a long time to finish

Posted by Magnús Skúlason <ma...@gmail.com>.

Hi,

I tried the parsechecker tool and as it turns out it hangs after printing out:
Content Metadata: Vary=Accept-Encoding Date=Thu, 23 Feb 2012 15:27:43
GMT Content-Length=3992 Expires=Thu, 19 Nov 1981 08:52:00 GMT
Content-Encoding=gzip
Set-Cookie=Shoper4Shop=a3ojqpk5ep6opahejfpiv98hf6; path=/
Content-Type=text/html Connection=close X-Powered-By=PHP/5.2.17
Server=Apache Pragma=no-cache Cache-Control=no-store, no-cache,
must-revalidate, post-check=0, pre-check=0
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8

but it does not give me a specific error or anything like that, is
there some way that I can turn that on? i.e. what java class do I want
to increase the log level for?

I also found a similar issue on some urls from another host, is there
any way to defend against this, i.e. setting a max timeout parameter
on the parser threads or anything like that? It seems to be a tedious
process to filter out the problematic urls by hand.

best regards,
Magnus

On Mon, Feb 20, 2012 at 4:16 AM, remi tassing <ta...@gmail.com> wrote:
> Hi,
>
> Could you also try the parsechecker tool on that last url? It's
> possible.that the file has a.problem or simply a bug.
>
> Remi
>
> On Sunday, February 19, 2012, Magnús Skúlason <ma...@gmail.com> wrote:
>> Hi,
>>
>> According to my logs a really long time +2 hours elapses between
>> parsing the last page in a segment until the ParseSegment finishes as
>> can be seen here:
>>
>> 2012-02-19 00:51:43,471 INFO  parse.ParseSegment - Parsing: http:// ....
>> 2012-02-19 03:15:18,604 INFO  parse.ParseSegment - ParseSegment:
>> finished at 2012-02-19 03:15:18, elapsed: 02:57:24
>>
>> Since the total time of the parse job is just around 3 hours, this
>> represents a huge portion of the overall time
>>
>> Is it normal that the last step in the job takes such a long time and
>> is there anything I can do to speed it up? I have been running the
>> generator with -topN 20000 I wouldn't have expected that to be a big
>> enough value to cause a problem. I have now reconfigured my script to
>> skip the -topN parameter to see what happens.
>>
>> best regards,
>> Magnus
>>

Re: ParseSegment taking a long time to finish

Posted by remi tassing <ta...@gmail.com>.

Hi,

Could you also try the parsechecker tool on that last url? It's
possible.that the file has a.problem or simply a bug.

Remi

On Sunday, February 19, 2012, Magnús Skúlason <ma...@gmail.com> wrote:
> Hi,
>
> According to my logs a really long time +2 hours elapses between
> parsing the last page in a segment until the ParseSegment finishes as
> can be seen here:
>
> 2012-02-19 00:51:43,471 INFO  parse.ParseSegment - Parsing: http:// ....
> 2012-02-19 03:15:18,604 INFO  parse.ParseSegment - ParseSegment:
> finished at 2012-02-19 03:15:18, elapsed: 02:57:24
>
> Since the total time of the parse job is just around 3 hours, this
> represents a huge portion of the overall time
>
> Is it normal that the last step in the job takes such a long time and
> is there anything I can do to speed it up? I have been running the
> generator with -topN 20000 I wouldn't have expected that to be a big
> enough value to cause a problem. I have now reconfigured my script to
> skip the -topN parameter to see what happens.
>
> best regards,
> Magnus
>