You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Martin Aesch <ma...@googlemail.com> on 2013/07/19 16:30:46 UTC
Nutch 2.2.1 parse (slow?)
Dear nutchers,
Having Nutch 2.2.1/HBase 0.90.6/Hadoop 1.1.2/6Mappers/6Reducers/Core
i7-3770/32GB (no swap)/2x3TB
When I parse (in mapper, 6 simultaneously running map-tasks), this is
very slow. Max load is ~1.5, max iowait is 5%, max CPU per task is only
30%, max CPU for hmaster is about 30%. iotop in consequence also shows
low numbers.
Since parsing is a CPU-intensive job and all IO-stuff is on very low
level, I wonder why parsing does not work faster und with full CPU
usage. It really takes a long time to finish. Where might be the
bottleneck?
Thanks for any advice,
Martin
Re: Nutch 2.2.1 parse (slow?)
Posted by Martin Aesch <ma...@googlemail.com>.
Hi Lewis,
I am in pseudo-distributed mode, all is local. I added some output of my
successful but slow parse job (nutch parse -resume) with altogether 82M
URLs in the webpage table, around 8M in the batch as I said, where one third was
unparsed. Took 24 hours to complete. Again, low io, low cpu usage.
Next trials were again parse-resumes, but as all stuff is parsed, absolutely no
parsing took place ("Skipping..."). I deactivated the firewall just in case some stuff is not bound to
localhost, did not make any difference. I tried gora.buffer.read.limit = 100000, no difference.
I checked the preceding generate job (webtable total size was also 82M URLs). This was completed in 1 hour. I did a fresh generate job,
which is on its way and will be in the same order of magnitude targeting 1 hour.
Both generatormapper und parsemapper look somehow very similar in terms of computational effort, as far as I can judge (which is obviously not too far).
On the other hand I see that parserjob adds more fields to read for ParseMapper/GoraMapper including the actual content.
(Regardless of "my" issue, this means that ParserMapper reads in really every piece of content of the webpage table, right?)
But as I said, my parse jobs were neither io-bound nor cpu bound.
What else could I try?
Martin
--------------------------------------------------------------------------------
13/07/21 02:28:58 INFO mapred.JobClient: map 99% reduce 0%
13/07/21 02:33:32 INFO mapred.JobClient: map 100% reduce 0%
13/07/21 02:33:32 INFO mapred.JobClient: Job complete:
job_201307121441_0022
13/07/21 02:33:32 INFO mapred.JobClient: Counters: 20
13/07/21 02:33:32 INFO mapred.JobClient: ParserStatus
13/07/21 02:33:32 INFO mapred.JobClient: failed=21314
13/07/21 02:33:32 INFO mapred.JobClient: success=2107033
13/07/21 02:33:32 INFO mapred.JobClient: Job Counters
13/07/21 02:33:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=521752464
13/07/21 02:33:32 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
13/07/21 02:33:32 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
13/07/21 02:33:32 INFO mapred.JobClient: Launched map tasks=223
13/07/21 02:33:32 INFO mapred.JobClient: Data-local map tasks=223
13/07/21 02:33:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/07/21 02:33:32 INFO mapred.JobClient: File Output Format Counters
13/07/21 02:33:32 INFO mapred.JobClient: Bytes Written=0
13/07/21 02:33:32 INFO mapred.JobClient: FileSystemCounters
13/07/21 02:33:32 INFO mapred.JobClient: HDFS_BYTES_READ=258415
13/07/21 02:33:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=17061620
13/07/21 02:33:32 INFO mapred.JobClient: File Input Format Counters
13/07/21 02:33:32 INFO mapred.JobClient: Bytes Read=0
13/07/21 02:33:32 INFO mapred.JobClient: Map-Reduce Framework
13/07/21 02:33:32 INFO mapred.JobClient: Map input records=82277147
13/07/21 02:33:32 INFO mapred.JobClient: Physical memory (bytes)
snapshot=69987168256
13/07/21 02:33:32 INFO mapred.JobClient: Spilled Records=0
13/07/21 02:33:32 INFO mapred.JobClient: CPU time spent
(ms)=46213980
13/07/21 02:33:32 INFO mapred.JobClient: Total committed heap usage
(bytes)=79857254400
13/07/21 02:33:32 INFO mapred.JobClient: Virtual memory (bytes)
snapshot=373927632896
13/07/21 02:33:32 INFO mapred.JobClient: Map output records=2657067
13/07/21 02:33:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=258415
13/07/21 02:33:32 INFO parse.ParserJob: ParserJob: success
-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-to: user@nutch.apache.org
To: user@nutch.apache.org <us...@nutch.apache.org>
Subject: Re: Nutch 2.2.1 parse (slow?)
Date: Sat, 20 Jul 2013 21:34:10 -0700
Hi Martin,
On Saturday, July 20, 2013, Martin Aesch <ma...@googlemail.com>
wrote:
> I have about 25K URLs per map task and around 8M URLs total
> All 6 mappers run and have continuously output. The aggregated parse
> rate is < 100URLs/sec.
wow this is painstakingly slow indeed. This was similar to the problem
folks were reporting prior to 2.2.1 release.
> What I did now is I replaced neko by tagsoup in nutch-site.xml and
> resumed the parsing. I see now as expected mostly Skipping ... already
> parsed. The aggrgated parse rate is the same, less than 100 URLs/sec.
> Load is now < 1, cpu is 95% idle. Looks somehow, if the mapper tasks do
> not get enough input.
wow... this is not the same as we were seeing before. Parsing is also
heavy on cpu... something defo fishy.
> Region server heap usage is "now" 4G out of 12G with about 225 regions
> assigned. I am monitoring my system with ganglia and did not see
> anything suspicious (being a hadoop/hbase noob). I am on the way to
> increase gora.buffer.read.limit for a new test. On the other hand, the
> default of 10000 seems to me very reasonable.
Yes it is a v reasonable default. Off topic, for Injetcing and some other
tasks I actually found a lower value of 1000 for gora writes (with
Cassandra backend) provided faster overall completion time.
Is the data all local or are you having to send it over the network?
I am merely trying to see why such low levels of URLs are being processed.
Re: Nutch 2.2.1 parse (slow?)
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Martin,
On Saturday, July 20, 2013, Martin Aesch <ma...@googlemail.com>
wrote:
> I have about 25K URLs per map task and around 8M URLs total
> All 6 mappers run and have continuously output. The aggregated parse
> rate is < 100URLs/sec.
wow this is painstakingly slow indeed. This was similar to the problem
folks were reporting prior to 2.2.1 release.
> What I did now is I replaced neko by tagsoup in nutch-site.xml and
> resumed the parsing. I see now as expected mostly Skipping ... already
> parsed. The aggrgated parse rate is the same, less than 100 URLs/sec.
> Load is now < 1, cpu is 95% idle. Looks somehow, if the mapper tasks do
> not get enough input.
wow... this is not the same as we were seeing before. Parsing is also
heavy on cpu... something defo fishy.
> Region server heap usage is "now" 4G out of 12G with about 225 regions
> assigned. I am monitoring my system with ganglia and did not see
> anything suspicious (being a hadoop/hbase noob). I am on the way to
> increase gora.buffer.read.limit for a new test. On the other hand, the
> default of 10000 seems to me very reasonable.
Yes it is a v reasonable default. Off topic, for Injetcing and some other
tasks I actually found a lower value of 1000 for gora writes (with
Cassandra backend) provided faster overall completion time.
Is the data all local or are you having to send it over the network?
I am merely trying to see why such low levels of URLs are being processed.
--
*Lewis*
Re: Nutch 2.2.1 parse (slow?)
Posted by Martin Aesch <ma...@googlemail.com>.
Hi Lewis,
I have about 25K URLs per map task and around 8M URLs total
All 6 mappers run and have continuously output. The aggregated parse
rate is < 100URLs/sec.
What I did now is I replaced neko by tagsoup in nutch-site.xml and
resumed the parsing. I see now as expected mostly Skipping ... already
parsed. The aggrgated parse rate is the same, less than 100 URLs/sec.
Load is now < 1, cpu is 95% idle. Looks somehow, if the mapper tasks do
not get enough input.
Region server heap usage is "now" 4G out of 12G with about 225 regions
assigned. I am monitoring my system with ganglia and did not see
anything suspicious (being a hadoop/hbase noob). I am on the way to
increase gora.buffer.read.limit for a new test. On the other hand, the
default of 10000 seems to me very reasonable.
Martin
On Fri, 2013-07-19 at 21:29 -0700, Lewis John Mcgibbney wrote:
> Hi Martin,
> Havve you checked that all mappers are working while parsing job is running?
> How many URLs are you trying to parse here?
>
> On Friday, July 19, 2013, Martin Aesch <ma...@googlemail.com> wrote:
> > Dear nutchers,
> >
> > Having Nutch 2.2.1/HBase 0.90.6/Hadoop 1.1.2/6Mappers/6Reducers/Core
> > i7-3770/32GB (no swap)/2x3TB
> >
> > When I parse (in mapper, 6 simultaneously running map-tasks), this is
> > very slow. Max load is ~1.5, max iowait is 5%, max CPU per task is only
> > 30%, max CPU for hmaster is about 30%. iotop in consequence also shows
> > low numbers.
> >
> > Since parsing is a CPU-intensive job and all IO-stuff is on very low
> > level, I wonder why parsing does not work faster und with full CPU
> > usage. It really takes a long time to finish. Where might be the
> > bottleneck?
> >
> > Thanks for any advice,
> > Martin
> >
> >
> >
> >
> >
>
Re: Nutch 2.2.1 parse (slow?)
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Martin,
Havve you checked that all mappers are working while parsing job is running?
How many URLs are you trying to parse here?
On Friday, July 19, 2013, Martin Aesch <ma...@googlemail.com> wrote:
> Dear nutchers,
>
> Having Nutch 2.2.1/HBase 0.90.6/Hadoop 1.1.2/6Mappers/6Reducers/Core
> i7-3770/32GB (no swap)/2x3TB
>
> When I parse (in mapper, 6 simultaneously running map-tasks), this is
> very slow. Max load is ~1.5, max iowait is 5%, max CPU per task is only
> 30%, max CPU for hmaster is about 30%. iotop in consequence also shows
> low numbers.
>
> Since parsing is a CPU-intensive job and all IO-stuff is on very low
> level, I wonder why parsing does not work faster und with full CPU
> usage. It really takes a long time to finish. Where might be the
> bottleneck?
>
> Thanks for any advice,
> Martin
>
>
>
>
>
--
*Lewis*