You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Martin Aesch <ma...@googlemail.com> on 2013/07/19 16:30:46 UTC

Nutch 2.2.1 parse (slow?)

Dear nutchers, 

Having Nutch 2.2.1/HBase 0.90.6/Hadoop 1.1.2/6Mappers/6Reducers/Core
i7-3770/32GB (no swap)/2x3TB

When I parse (in mapper, 6 simultaneously running map-tasks), this is
very slow. Max load is ~1.5, max iowait is 5%, max CPU per task is only
30%, max CPU for hmaster is about 30%. iotop in consequence also shows
low numbers.

Since parsing is a CPU-intensive job and all IO-stuff is on very low
level, I wonder why parsing does not work faster und with full CPU
usage. It really takes a long time to finish. Where might be the
bottleneck?

Thanks for any advice,
Martin

 



Re: Nutch 2.2.1 parse (slow?)

Posted by Martin Aesch <ma...@googlemail.com>.
Hi Lewis,

I am in pseudo-distributed mode, all is local. I added some output of my
successful but slow parse job (nutch parse -resume) with altogether 82M
URLs in the webpage table, around 8M in the batch as I said, where one third was
unparsed. Took 24 hours to complete. Again, low io, low cpu usage.

Next trials were again parse-resumes, but as all stuff is parsed, absolutely no
parsing took place ("Skipping..."). I deactivated the firewall just in case some stuff is not bound to
localhost, did not make any difference. I tried gora.buffer.read.limit = 100000, no difference.

I checked the preceding generate job (webtable total size was also 82M URLs). This was completed in 1 hour. I did a fresh generate job,
which is on its way and will be in the same order of magnitude targeting 1 hour.

Both generatormapper und parsemapper look somehow very similar in terms of computational effort, as far as I can judge (which is obviously not too far). 
On the other hand I see that parserjob adds more fields to read for ParseMapper/GoraMapper including the actual content.
(Regardless of "my" issue, this means  that ParserMapper reads in really every piece of content of the webpage table, right?)
But as I said, my parse jobs were neither io-bound nor cpu bound.

What else could I try?


Martin

--------------------------------------------------------------------------------
13/07/21 02:28:58 INFO mapred.JobClient:  map 99% reduce 0%
13/07/21 02:33:32 INFO mapred.JobClient:  map 100% reduce 0%
13/07/21 02:33:32 INFO mapred.JobClient: Job complete:
job_201307121441_0022
13/07/21 02:33:32 INFO mapred.JobClient: Counters: 20
13/07/21 02:33:32 INFO mapred.JobClient:   ParserStatus
13/07/21 02:33:32 INFO mapred.JobClient:     failed=21314
13/07/21 02:33:32 INFO mapred.JobClient:     success=2107033
13/07/21 02:33:32 INFO mapred.JobClient:   Job Counters 
13/07/21 02:33:32 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=521752464
13/07/21 02:33:32 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
13/07/21 02:33:32 INFO mapred.JobClient:     Total time spent by all
maps waiting after reserving slots (ms)=0
13/07/21 02:33:32 INFO mapred.JobClient:     Launched map tasks=223
13/07/21 02:33:32 INFO mapred.JobClient:     Data-local map tasks=223
13/07/21 02:33:32 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/07/21 02:33:32 INFO mapred.JobClient:   File Output Format Counters 
13/07/21 02:33:32 INFO mapred.JobClient:     Bytes Written=0
13/07/21 02:33:32 INFO mapred.JobClient:   FileSystemCounters
13/07/21 02:33:32 INFO mapred.JobClient:     HDFS_BYTES_READ=258415
13/07/21 02:33:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=17061620
13/07/21 02:33:32 INFO mapred.JobClient:   File Input Format Counters 
13/07/21 02:33:32 INFO mapred.JobClient:     Bytes Read=0
13/07/21 02:33:32 INFO mapred.JobClient:   Map-Reduce Framework
13/07/21 02:33:32 INFO mapred.JobClient:     Map input records=82277147
13/07/21 02:33:32 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=69987168256
13/07/21 02:33:32 INFO mapred.JobClient:     Spilled Records=0
13/07/21 02:33:32 INFO mapred.JobClient:     CPU time spent
(ms)=46213980
13/07/21 02:33:32 INFO mapred.JobClient:     Total committed heap usage
(bytes)=79857254400
13/07/21 02:33:32 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=373927632896
13/07/21 02:33:32 INFO mapred.JobClient:     Map output records=2657067
13/07/21 02:33:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=258415
13/07/21 02:33:32 INFO parse.ParserJob: ParserJob: success









-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-to: user@nutch.apache.org
To: user@nutch.apache.org <us...@nutch.apache.org>
Subject: Re: Nutch 2.2.1 parse (slow?)
Date: Sat, 20 Jul 2013 21:34:10 -0700

Hi Martin,

On Saturday, July 20, 2013, Martin Aesch <ma...@googlemail.com>
wrote:
> I have about 25K URLs per map task and around 8M URLs total
> All 6 mappers run and have continuously output. The aggregated parse
> rate is < 100URLs/sec.

wow this is painstakingly slow indeed. This was similar to the problem
folks were reporting prior to 2.2.1 release.

> What I did now is I replaced neko by tagsoup in nutch-site.xml and
> resumed the parsing. I see now as expected mostly Skipping ... already
> parsed. The aggrgated parse rate is the same, less than 100 URLs/sec.
> Load is now < 1, cpu is 95% idle. Looks somehow, if the mapper tasks do
> not get enough input.

wow...  this is not the same as we were seeing before. Parsing is also
heavy on cpu... something defo fishy.

> Region server heap usage is "now" 4G out of 12G with about 225 regions
> assigned. I am monitoring my system with ganglia and did not see
> anything suspicious (being a hadoop/hbase noob). I am on the way to
> increase gora.buffer.read.limit for a new test. On the other hand, the
> default of 10000 seems to me very reasonable.

Yes it is a v reasonable default. Off topic, for Injetcing and some other
tasks I actually found a lower value of 1000 for gora writes (with
Cassandra backend) provided faster overall completion time.

Is the data all local or are you having to send it over the network?
I am merely trying to see why such low levels of URLs are being processed.





Re: Nutch 2.2.1 parse (slow?)

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Martin,

On Saturday, July 20, 2013, Martin Aesch <ma...@googlemail.com>
wrote:
> I have about 25K URLs per map task and around 8M URLs total
> All 6 mappers run and have continuously output. The aggregated parse
> rate is < 100URLs/sec.

wow this is painstakingly slow indeed. This was similar to the problem
folks were reporting prior to 2.2.1 release.

> What I did now is I replaced neko by tagsoup in nutch-site.xml and
> resumed the parsing. I see now as expected mostly Skipping ... already
> parsed. The aggrgated parse rate is the same, less than 100 URLs/sec.
> Load is now < 1, cpu is 95% idle. Looks somehow, if the mapper tasks do
> not get enough input.

wow...  this is not the same as we were seeing before. Parsing is also
heavy on cpu... something defo fishy.

> Region server heap usage is "now" 4G out of 12G with about 225 regions
> assigned. I am monitoring my system with ganglia and did not see
> anything suspicious (being a hadoop/hbase noob). I am on the way to
> increase gora.buffer.read.limit for a new test. On the other hand, the
> default of 10000 seems to me very reasonable.

Yes it is a v reasonable default. Off topic, for Injetcing and some other
tasks I actually found a lower value of 1000 for gora writes (with
Cassandra backend) provided faster overall completion time.

Is the data all local or are you having to send it over the network?
I am merely trying to see why such low levels of URLs are being processed.

-- 
*Lewis*

Re: Nutch 2.2.1 parse (slow?)

Posted by Martin Aesch <ma...@googlemail.com>.
Hi Lewis,

I have about 25K URLs per map task and around 8M URLs total
All 6 mappers run and have continuously output. The aggregated parse
rate is < 100URLs/sec.

What I did now is I replaced neko by tagsoup in nutch-site.xml and
resumed the parsing. I see now as expected mostly Skipping ... already
parsed. The aggrgated parse rate is the same, less than 100 URLs/sec.
Load is now < 1, cpu is 95% idle. Looks somehow, if the mapper tasks do
not get enough input.

Region server heap usage is "now" 4G out of 12G with about 225 regions
assigned. I am monitoring my system with ganglia and did not see
anything suspicious (being a hadoop/hbase noob). I am on the way to 
increase gora.buffer.read.limit for a new test. On the other hand, the
default of 10000 seems to me very reasonable.

Martin

On Fri, 2013-07-19 at 21:29 -0700, Lewis John Mcgibbney wrote:
> Hi Martin,
> Havve you checked that all mappers are working while parsing job is running?
> How many URLs are you trying to parse here?
> 
> On Friday, July 19, 2013, Martin Aesch <ma...@googlemail.com> wrote:
> > Dear nutchers,
> >
> > Having Nutch 2.2.1/HBase 0.90.6/Hadoop 1.1.2/6Mappers/6Reducers/Core
> > i7-3770/32GB (no swap)/2x3TB
> >
> > When I parse (in mapper, 6 simultaneously running map-tasks), this is
> > very slow. Max load is ~1.5, max iowait is 5%, max CPU per task is only
> > 30%, max CPU for hmaster is about 30%. iotop in consequence also shows
> > low numbers.
> >
> > Since parsing is a CPU-intensive job and all IO-stuff is on very low
> > level, I wonder why parsing does not work faster und with full CPU
> > usage. It really takes a long time to finish. Where might be the
> > bottleneck?
> >
> > Thanks for any advice,
> > Martin
> >
> >
> >
> >
> >
> 


Re: Nutch 2.2.1 parse (slow?)

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Martin,
Havve you checked that all mappers are working while parsing job is running?
How many URLs are you trying to parse here?

On Friday, July 19, 2013, Martin Aesch <ma...@googlemail.com> wrote:
> Dear nutchers,
>
> Having Nutch 2.2.1/HBase 0.90.6/Hadoop 1.1.2/6Mappers/6Reducers/Core
> i7-3770/32GB (no swap)/2x3TB
>
> When I parse (in mapper, 6 simultaneously running map-tasks), this is
> very slow. Max load is ~1.5, max iowait is 5%, max CPU per task is only
> 30%, max CPU for hmaster is about 30%. iotop in consequence also shows
> low numbers.
>
> Since parsing is a CPU-intensive job and all IO-stuff is on very low
> level, I wonder why parsing does not work faster und with full CPU
> usage. It really takes a long time to finish. Where might be the
> bottleneck?
>
> Thanks for any advice,
> Martin
>
>
>
>
>

-- 
*Lewis*