You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Scott Gonyea <sc...@aitrus.org> on 2010/08/02 10:17:32 UTC

Seeking Insight into Nutch Configurations

Hi, I've been digging through Google and the archives quite thoroughly, to little avail. Please excuse any grammar mistakes; I just moved and lack Internet for my laptop.

The big problem that I am facing, thus far, occurs on the 4th fetch. All but 1 or 2 maps complete. All of the running reduces stall (0.00 MB/s), presumably because they are waiting on that map to finish? I really don't know and it's frustrating.

I've been playing heavily with the formula, but however many maps/reduces I set in mapred-site, it has the same outcome.

I've created dozens of hadoop AMIs that have had tweaks in the following ranges:
Memory assigned: 512m-2048m
Fetcher threads: 64-1024 (King of the DoS!)
Tracker Concurrent Maps: 1-32
Jobtracker Total Maps: 11(1/node)-1091~
Tracker Concurrent Reduces: 1-32
Jobtracker Total Reduces: 11(1/node)-1091~

There are more and I'll share some of my conf files once I'm able to do so. I would sincerely appreciate some insight into how to configure the various settings in Nutch/Hadoop.

My scenario:
# Sites: 10,000-30,000 per crawl
Depth: ~5
Content: Text is all that I care for. (HTML/RSS/XML)
Nodes: Amazon EC2 (ugh)
Storage: I've performed crawls with HDFS and with amazon S3. I thought S3 would be more performant, yet it doesn't appear to affect matters.
Cost vs Speed: I don't mind throwing EC2 instances at this to get it done quickly... But I can't imagine I need much more than 10-20 mid-size instances for this.

Can anyone share their own experiences in the performance they've seen?

Thank you very much,
Scott Gonyea

Re: Seeking Insight into Nutch Configurations

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-08-02 22:59, Scott Gonyea wrote:
> By the way, can anyone tell me if there is a way to explicitly limit how
> many pages should be fetched, per fetcher-task?

I believe that in general case it would be a very complex problem to 
solve so that you get exact results. The reason is that Nutch doesn't 
use any global lock manager, so the only way to ensure a proper per-host 
locking is to assign all URL-s from any given host to the same map task. 
This may (and often will) create an imbalance in the number of allocated 
URL-s per task.

One method to mitigate this imbalance is to set generate.max.count (in 
trunk, generate.max.per.host in 1.1) - this will limit the number of 
URL-s from any given host to X, thus helping in a more balanced mixing 
of these N per-host chunks across M maps.

> I think part of the problem is that, seemingly, Nutch seems to be
> generating some really unbalanced fetcher tasks.
>
> The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.
>   Each higher-numbered task had fewer pages to fetch.  Task 000180 only
> had 44 pages to fetch.

There's no specific tool to examine the composition of fetchlist 
parts... try running this in the segments/2010*/crawl_generate/:

for i in part-00*
do
	echo "---- part $i -----"
	strings $i | grep http://
done

to print URL-s per map task. Most likely you will see that there was no 
other way to allocate the URLs per task to satisfy the constraint that I 
explained above. If it's not the case, then it's a bug. :)

>
> This *huge* imbalance, I think, creates tasks that are seemingly
> unpredictable.  All of my other resources just sit around, wasting
> resources, until one task grabs some crazy number of sites.

Again, generate.max.count is your friend - even though you won't be able 
to get all pages from a big site in one go, at least your crawls will 
finish quickly and you will quickly progress breadth-wise, if not 
depth-wise.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Seeking Insight into Nutch Configurations

Posted by AJ Chen <aj...@web2express.org>.

Scott, thank you for your effort. fantastic. -aj

On Mon, Aug 2, 2010 at 3:34 PM, Scott Gonyea <sc...@aitrus.org> wrote:

> I have created a long stream of AMIs, and had planned to build one out that
> can be used as a cookie-cutter for everyone else.  Time isn't as long as it
> used to be (stupid physics).
>
> Besides scrubbing any client-sensitive residue from such an image, I'd need
> to figure out how to tell the slaves to point at the master node.  Right
> now, I use an elastic IP and I carve that into the AMI's hadoop
> installation.
>
> The AMI, by the way, is basically just your run-of-the-mill Hadoop cluster,
> with the various conf settings having been tuned to my
> trial-by-instance-hour-billing experiences.
>
> As it stands, if I did create such an AMI--I'd probably create with the
> assumption that everyone will happily install Rudy (
> http://github.com/solutious/rudy) and have it launch the master, pull out
> the IP information for the master node, and then slap that into the hadoop
> conf for each slave node.
>
> I don't follow the Nutch+Hadoop tutorial, in that I don't use the
> master/slave files.  Rather, the Hadoop slaves just talk to the Job Tracker
> and say, "hey, you can use me."  It has the benefit of allowing me to scale
> up, should the load be obscene.  However, because of how things balance
> out,
> I can't really scale down so easily.
>
> Actually, I really want to figure out how I can micro-manage Hadoop (as
> well
> as Nutch).  Nutch, because the segments need to be controlled.  Hadoop,
> because it would be far nicer to assign work based on a node's 'historical'
> performance... and less reliant on hard limits placed into mapred-site.
>
> I may or may not be babbling at this point.
>
> If anyone wants me to put some energy into that AMI, I'll take "votes of
> encouragement" in the form of sharing your non-sensitive (ie, domain names,
> company-sensitive data) conf files and what your performance results have
> been from them.  They can also be sent directly to me, for whatever reason
> you'd prefer to do that.
>
> I'll start looking into how best to create a non-suck AMI, but that'll be
> largely done in my "nobody thinks my time is worth money" hours of the day,
> which my wife sometimes interprets as "there's more work left in the
> man-slob."
>
> sg
>
> On Mon, Aug 2, 2010 at 3:08 PM, AJ Chen <aj...@web2express.org> wrote:
>
> > Does anyone have a EC2 image that runs smoothly for >3000 domains?  If a
> > sample of the complete nutch&hadoop configurations for distributed
> crawling
> > on EC2 is available for the community, that will help anyone to learn
> nutch
> > best practice quickly.
> > -aj
> >
> >
> > On Mon, Aug 2, 2010 at 1:59 PM, Scott Gonyea <sc...@aitrus.org> wrote:
> >
> > > By the way, can anyone tell me if there is a way to explicitly limit
> how
> > > many pages should be fetched, per fetcher-task?
> > >
> > > I know that one limit to this, is that each site/domain/whatever could
> > > exceed that limit (assuming the limit were lower than the number of
> those
> > > sites).  For politeness, that limit would have to be soft.  But that's
> > more
> > > than suitable, in my opinion.
> > >
> > > I think part of the problem is that, seemingly, Nutch seems to be
> > > generating
> > > some really unbalanced fetcher tasks.
> > >
> > > The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.
>  Each
> > > higher-numbered task had fewer pages to fetch.  Task 000180 only had 44
> > > pages to fetch.
> > >
> > > This *huge* imbalance, I think, creates tasks that are seemingly
> > > unpredictable.  All of my other resources just sit around, wasting
> > > resources, until one task grabs some crazy number of sites.
> > >
> > > Thanks again,
> > > sg
> > >
> > > On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <me...@sgonyea.com> wrote:
> > >
> > > > Thank you very much, Adrzej.  I'm really hoping some people can share
> > > some
> > > > non-sensitive details of their setup.  I'm really curious about the
> > > > following:
> > > >
> > > > The ratio of Maps to Reduces for their nutch jobs?
> > > > The amount of memory that they allocate to each job task?
> > > > The number of simultaneous Maps/Reduces on any given host?
> > > > The number of fetcher threads they execute?
> > > >
> > > > Any config setup people can share would be great, so I can have a
> > > different
> > > > perspective on how people setup their nutch-site and mapred-site
> files.
> > > >
> > > > At the moment, I'm experimenting with the following configs:
> > > >
> > > > http://gist.github.com/505065
> > > >
> > > > I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run
> > at
> > > > any given time.  I have Nutch firing off 181 Maps and 41 Reduces.
> >  Those
> > > are
> > > > both prime numbers, but I don't know if that really matters.  I've
> seen
> > > > Hadoop say that the number of reducers should be around the number of
> > > nodes
> > > > you have (the nearest prime).  I've seen, somewhere, some suggestions
> > > that
> > > > Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have
> > > insight
> > > > to share on that?
> > > >
> > > > Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.
> >  I'm
> > > > waiting for it to return to the 4th fetch step, so I can see why
> Nutch
> > > hates
> > > > me so much.
> > > >
> > > > sg
> > > >
> > > > On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <ab...@getopt.org>
> > wrote:
> > > >
> > > >> On 2010-08-02 10:17, Scott Gonyea wrote:
> > > >>
> > > >>> The big problem that I am facing, thus far, occurs on the 4th
> fetch.
> > > >>> All but 1 or 2 maps complete. All of the running reduces stall
> (0.00
> > > >>> MB/s), presumably because they are waiting on that map to finish? I
> > > >>> really don't know and it's frustrating.
> > > >>>
> > > >>
> > > >> Yes, all map tasks need to finish before reduce tasks are able to
> > > proceed.
> > > >> The reason is that each reduce task receives a portion of the
> keyspace
> > > (and
> > > >> values) according to the Partitioner, and in order to prepare a nice
> > > <key,
> > > >> list(value)> in your reducer it needs to, well, get all the values
> > under
> > > >> this key first, whichever map task produced the tuples, and then
> sort
> > > them.
> > > >>
> > > >> The failing tasks probably fail due to some other factor, and very
> > > likely
> > > >> (based on my experience) the failure is related to some particular
> > URLs.
> > > >> E.g. regex URL filtering can choke on some pathological URLs, like
> > URLs
> > > 20kB
> > > >> long, or containing '\0' etc, etc. In my experience, it's best to
> keep
> > > regex
> > > >> filtering to a minimum if you can, and use other urlfilters (prefix,
> > > domain,
> > > >> suffix, custom) to limit your crawling frontier. There are simply
> too
> > > many
> > > >> ways where a regex engine can lock up.
> > > >>
> > > >> Please check the logs of the failing tasks. If you see that a task
> is
> > > >> stalled you could also log in to the node, and generate a thread
> dump
> > a
> > > few
> > > >> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the
> > > regex
> > > >> processing then it's likely this is your problem.
> > > >>
> > > >>
> > > >>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content:
> Text
> > > >>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
> > > >>> Storage: I've performed crawls with HDFS and with amazon S3. I
> > > >>> thought S3 would be more performant, yet it doesn't appear to
> affect
> > > >>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
> > > >>> to get it done quickly... But I can't imagine I need much more than
> > > >>> 10-20 mid-size instances for this.
> > > >>>
> > > >>
> > > >> That's correct - with this number of unique sites the max.
> throughput
> > of
> > > >> your crawl will be ultimately limited by the politeness limits (# of
> > > >> requests/site/sec).
> > > >>
> > > >>
> > > >>
> > > >>> Can anyone share their own experiences in the performance they've
> > > >>> seen?
> > > >>>
> > > >>
> > > >> There is a very simple benchmark in trunk/ that you could use to
> > measure
> > > >> the raw performance (data processing throughput) of your EC2
> cluster.
> > > The
> > > >> real-life performance, though, will depend on many other factors,
> such
> > > as
> > > >> the number of unique sites, their individual speed, and (rarely) the
> > > total
> > > >> bandwidth at your end.
> > > >>
> > > >>
> > > >> --
> > > >> Best regards,
> > > >> Andrzej Bialecki     <><
> > > >>  ___. ___ ___ ___ _ _   __________________________________
> > > >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > >> http://www.sigram.com  Contact: info at sigram dot com
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Seeking Insight into Nutch Configurations

Posted by Scott Gonyea <sc...@aitrus.org>.

I have created a long stream of AMIs, and had planned to build one out that
can be used as a cookie-cutter for everyone else.  Time isn't as long as it
used to be (stupid physics).

Besides scrubbing any client-sensitive residue from such an image, I'd need
to figure out how to tell the slaves to point at the master node.  Right
now, I use an elastic IP and I carve that into the AMI's hadoop
installation.

The AMI, by the way, is basically just your run-of-the-mill Hadoop cluster,
with the various conf settings having been tuned to my
trial-by-instance-hour-billing experiences.

As it stands, if I did create such an AMI--I'd probably create with the
assumption that everyone will happily install Rudy (
http://github.com/solutious/rudy) and have it launch the master, pull out
the IP information for the master node, and then slap that into the hadoop
conf for each slave node.

I don't follow the Nutch+Hadoop tutorial, in that I don't use the
master/slave files.  Rather, the Hadoop slaves just talk to the Job Tracker
and say, "hey, you can use me."  It has the benefit of allowing me to scale
up, should the load be obscene.  However, because of how things balance out,
I can't really scale down so easily.

Actually, I really want to figure out how I can micro-manage Hadoop (as well
as Nutch).  Nutch, because the segments need to be controlled.  Hadoop,
because it would be far nicer to assign work based on a node's 'historical'
performance... and less reliant on hard limits placed into mapred-site.

I may or may not be babbling at this point.

If anyone wants me to put some energy into that AMI, I'll take "votes of
encouragement" in the form of sharing your non-sensitive (ie, domain names,
company-sensitive data) conf files and what your performance results have
been from them.  They can also be sent directly to me, for whatever reason
you'd prefer to do that.

I'll start looking into how best to create a non-suck AMI, but that'll be
largely done in my "nobody thinks my time is worth money" hours of the day,
which my wife sometimes interprets as "there's more work left in the
man-slob."

sg

On Mon, Aug 2, 2010 at 3:08 PM, AJ Chen <aj...@web2express.org> wrote:

> Does anyone have a EC2 image that runs smoothly for >3000 domains?  If a
> sample of the complete nutch&hadoop configurations for distributed crawling
> on EC2 is available for the community, that will help anyone to learn nutch
> best practice quickly.
> -aj
>
>
> On Mon, Aug 2, 2010 at 1:59 PM, Scott Gonyea <sc...@aitrus.org> wrote:
>
> > By the way, can anyone tell me if there is a way to explicitly limit how
> > many pages should be fetched, per fetcher-task?
> >
> > I know that one limit to this, is that each site/domain/whatever could
> > exceed that limit (assuming the limit were lower than the number of those
> > sites).  For politeness, that limit would have to be soft.  But that's
> more
> > than suitable, in my opinion.
> >
> > I think part of the problem is that, seemingly, Nutch seems to be
> > generating
> > some really unbalanced fetcher tasks.
> >
> > The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.  Each
> > higher-numbered task had fewer pages to fetch.  Task 000180 only had 44
> > pages to fetch.
> >
> > This *huge* imbalance, I think, creates tasks that are seemingly
> > unpredictable.  All of my other resources just sit around, wasting
> > resources, until one task grabs some crazy number of sites.
> >
> > Thanks again,
> > sg
> >
> > On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <me...@sgonyea.com> wrote:
> >
> > > Thank you very much, Adrzej.  I'm really hoping some people can share
> > some
> > > non-sensitive details of their setup.  I'm really curious about the
> > > following:
> > >
> > > The ratio of Maps to Reduces for their nutch jobs?
> > > The amount of memory that they allocate to each job task?
> > > The number of simultaneous Maps/Reduces on any given host?
> > > The number of fetcher threads they execute?
> > >
> > > Any config setup people can share would be great, so I can have a
> > different
> > > perspective on how people setup their nutch-site and mapred-site files.
> > >
> > > At the moment, I'm experimenting with the following configs:
> > >
> > > http://gist.github.com/505065
> > >
> > > I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run
> at
> > > any given time.  I have Nutch firing off 181 Maps and 41 Reduces.
>  Those
> > are
> > > both prime numbers, but I don't know if that really matters.  I've seen
> > > Hadoop say that the number of reducers should be around the number of
> > nodes
> > > you have (the nearest prime).  I've seen, somewhere, some suggestions
> > that
> > > Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have
> > insight
> > > to share on that?
> > >
> > > Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.
>  I'm
> > > waiting for it to return to the 4th fetch step, so I can see why Nutch
> > hates
> > > me so much.
> > >
> > > sg
> > >
> > > On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <ab...@getopt.org>
> wrote:
> > >
> > >> On 2010-08-02 10:17, Scott Gonyea wrote:
> > >>
> > >>> The big problem that I am facing, thus far, occurs on the 4th fetch.
> > >>> All but 1 or 2 maps complete. All of the running reduces stall (0.00
> > >>> MB/s), presumably because they are waiting on that map to finish? I
> > >>> really don't know and it's frustrating.
> > >>>
> > >>
> > >> Yes, all map tasks need to finish before reduce tasks are able to
> > proceed.
> > >> The reason is that each reduce task receives a portion of the keyspace
> > (and
> > >> values) according to the Partitioner, and in order to prepare a nice
> > <key,
> > >> list(value)> in your reducer it needs to, well, get all the values
> under
> > >> this key first, whichever map task produced the tuples, and then sort
> > them.
> > >>
> > >> The failing tasks probably fail due to some other factor, and very
> > likely
> > >> (based on my experience) the failure is related to some particular
> URLs.
> > >> E.g. regex URL filtering can choke on some pathological URLs, like
> URLs
> > 20kB
> > >> long, or containing '\0' etc, etc. In my experience, it's best to keep
> > regex
> > >> filtering to a minimum if you can, and use other urlfilters (prefix,
> > domain,
> > >> suffix, custom) to limit your crawling frontier. There are simply too
> > many
> > >> ways where a regex engine can lock up.
> > >>
> > >> Please check the logs of the failing tasks. If you see that a task is
> > >> stalled you could also log in to the node, and generate a thread dump
> a
> > few
> > >> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the
> > regex
> > >> processing then it's likely this is your problem.
> > >>
> > >>
> > >>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
> > >>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
> > >>> Storage: I've performed crawls with HDFS and with amazon S3. I
> > >>> thought S3 would be more performant, yet it doesn't appear to affect
> > >>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
> > >>> to get it done quickly... But I can't imagine I need much more than
> > >>> 10-20 mid-size instances for this.
> > >>>
> > >>
> > >> That's correct - with this number of unique sites the max. throughput
> of
> > >> your crawl will be ultimately limited by the politeness limits (# of
> > >> requests/site/sec).
> > >>
> > >>
> > >>
> > >>> Can anyone share their own experiences in the performance they've
> > >>> seen?
> > >>>
> > >>
> > >> There is a very simple benchmark in trunk/ that you could use to
> measure
> > >> the raw performance (data processing throughput) of your EC2 cluster.
> > The
> > >> real-life performance, though, will depend on many other factors, such
> > as
> > >> the number of unique sites, their individual speed, and (rarely) the
> > total
> > >> bandwidth at your end.
> > >>
> > >>
> > >> --
> > >> Best regards,
> > >> Andrzej Bialecki     <><
> > >>  ___. ___ ___ ___ _ _   __________________________________
> > >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > >> http://www.sigram.com  Contact: info at sigram dot com
> > >>
> > >>
> > >
> >
>

Re: Seeking Insight into Nutch Configurations

Posted by AJ Chen <aj...@web2express.org>.

Does anyone have a EC2 image that runs smoothly for >3000 domains?  If a
sample of the complete nutch&hadoop configurations for distributed crawling
on EC2 is available for the community, that will help anyone to learn nutch
best practice quickly.
-aj


On Mon, Aug 2, 2010 at 1:59 PM, Scott Gonyea <sc...@aitrus.org> wrote:

> By the way, can anyone tell me if there is a way to explicitly limit how
> many pages should be fetched, per fetcher-task?
>
> I know that one limit to this, is that each site/domain/whatever could
> exceed that limit (assuming the limit were lower than the number of those
> sites).  For politeness, that limit would have to be soft.  But that's more
> than suitable, in my opinion.
>
> I think part of the problem is that, seemingly, Nutch seems to be
> generating
> some really unbalanced fetcher tasks.
>
> The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.  Each
> higher-numbered task had fewer pages to fetch.  Task 000180 only had 44
> pages to fetch.
>
> This *huge* imbalance, I think, creates tasks that are seemingly
> unpredictable.  All of my other resources just sit around, wasting
> resources, until one task grabs some crazy number of sites.
>
> Thanks again,
> sg
>
> On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>
> > Thank you very much, Adrzej.  I'm really hoping some people can share
> some
> > non-sensitive details of their setup.  I'm really curious about the
> > following:
> >
> > The ratio of Maps to Reduces for their nutch jobs?
> > The amount of memory that they allocate to each job task?
> > The number of simultaneous Maps/Reduces on any given host?
> > The number of fetcher threads they execute?
> >
> > Any config setup people can share would be great, so I can have a
> different
> > perspective on how people setup their nutch-site and mapred-site files.
> >
> > At the moment, I'm experimenting with the following configs:
> >
> > http://gist.github.com/505065
> >
> > I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run at
> > any given time.  I have Nutch firing off 181 Maps and 41 Reduces.  Those
> are
> > both prime numbers, but I don't know if that really matters.  I've seen
> > Hadoop say that the number of reducers should be around the number of
> nodes
> > you have (the nearest prime).  I've seen, somewhere, some suggestions
> that
> > Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have
> insight
> > to share on that?
> >
> > Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.  I'm
> > waiting for it to return to the 4th fetch step, so I can see why Nutch
> hates
> > me so much.
> >
> > sg
> >
> > On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
> >
> >> On 2010-08-02 10:17, Scott Gonyea wrote:
> >>
> >>> The big problem that I am facing, thus far, occurs on the 4th fetch.
> >>> All but 1 or 2 maps complete. All of the running reduces stall (0.00
> >>> MB/s), presumably because they are waiting on that map to finish? I
> >>> really don't know and it's frustrating.
> >>>
> >>
> >> Yes, all map tasks need to finish before reduce tasks are able to
> proceed.
> >> The reason is that each reduce task receives a portion of the keyspace
> (and
> >> values) according to the Partitioner, and in order to prepare a nice
> <key,
> >> list(value)> in your reducer it needs to, well, get all the values under
> >> this key first, whichever map task produced the tuples, and then sort
> them.
> >>
> >> The failing tasks probably fail due to some other factor, and very
> likely
> >> (based on my experience) the failure is related to some particular URLs.
> >> E.g. regex URL filtering can choke on some pathological URLs, like URLs
> 20kB
> >> long, or containing '\0' etc, etc. In my experience, it's best to keep
> regex
> >> filtering to a minimum if you can, and use other urlfilters (prefix,
> domain,
> >> suffix, custom) to limit your crawling frontier. There are simply too
> many
> >> ways where a regex engine can lock up.
> >>
> >> Please check the logs of the failing tasks. If you see that a task is
> >> stalled you could also log in to the node, and generate a thread dump a
> few
> >> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the
> regex
> >> processing then it's likely this is your problem.
> >>
> >>
> >>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
> >>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
> >>> Storage: I've performed crawls with HDFS and with amazon S3. I
> >>> thought S3 would be more performant, yet it doesn't appear to affect
> >>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
> >>> to get it done quickly... But I can't imagine I need much more than
> >>> 10-20 mid-size instances for this.
> >>>
> >>
> >> That's correct - with this number of unique sites the max. throughput of
> >> your crawl will be ultimately limited by the politeness limits (# of
> >> requests/site/sec).
> >>
> >>
> >>
> >>> Can anyone share their own experiences in the performance they've
> >>> seen?
> >>>
> >>
> >> There is a very simple benchmark in trunk/ that you could use to measure
> >> the raw performance (data processing throughput) of your EC2 cluster.
> The
> >> real-life performance, though, will depend on many other factors, such
> as
> >> the number of unique sites, their individual speed, and (rarely) the
> total
> >> bandwidth at your end.
> >>
> >>
> >> --
> >> Best regards,
> >> Andrzej Bialecki     <><
> >>  ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> >>
> >
>

Re: Seeking Insight into Nutch Configurations

Posted by Scott Gonyea <sc...@aitrus.org>.

By the way, can anyone tell me if there is a way to explicitly limit how
many pages should be fetched, per fetcher-task?

I know that one limit to this, is that each site/domain/whatever could
exceed that limit (assuming the limit were lower than the number of those
sites).  For politeness, that limit would have to be soft.  But that's more
than suitable, in my opinion.

I think part of the problem is that, seemingly, Nutch seems to be generating
some really unbalanced fetcher tasks.

The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.  Each
higher-numbered task had fewer pages to fetch.  Task 000180 only had 44
pages to fetch.

This *huge* imbalance, I think, creates tasks that are seemingly
unpredictable.  All of my other resources just sit around, wasting
resources, until one task grabs some crazy number of sites.

Thanks again,
sg

On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <me...@sgonyea.com> wrote:

> Thank you very much, Adrzej.  I'm really hoping some people can share some
> non-sensitive details of their setup.  I'm really curious about the
> following:
>
> The ratio of Maps to Reduces for their nutch jobs?
> The amount of memory that they allocate to each job task?
> The number of simultaneous Maps/Reduces on any given host?
> The number of fetcher threads they execute?
>
> Any config setup people can share would be great, so I can have a different
> perspective on how people setup their nutch-site and mapred-site files.
>
> At the moment, I'm experimenting with the following configs:
>
> http://gist.github.com/505065
>
> I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run at
> any given time.  I have Nutch firing off 181 Maps and 41 Reduces.  Those are
> both prime numbers, but I don't know if that really matters.  I've seen
> Hadoop say that the number of reducers should be around the number of nodes
> you have (the nearest prime).  I've seen, somewhere, some suggestions that
> Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have insight
> to share on that?
>
> Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.  I'm
> waiting for it to return to the 4th fetch step, so I can see why Nutch hates
> me so much.
>
> sg
>
> On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
>
>> On 2010-08-02 10:17, Scott Gonyea wrote:
>>
>>> The big problem that I am facing, thus far, occurs on the 4th fetch.
>>> All but 1 or 2 maps complete. All of the running reduces stall (0.00
>>> MB/s), presumably because they are waiting on that map to finish? I
>>> really don't know and it's frustrating.
>>>
>>
>> Yes, all map tasks need to finish before reduce tasks are able to proceed.
>> The reason is that each reduce task receives a portion of the keyspace (and
>> values) according to the Partitioner, and in order to prepare a nice <key,
>> list(value)> in your reducer it needs to, well, get all the values under
>> this key first, whichever map task produced the tuples, and then sort them.
>>
>> The failing tasks probably fail due to some other factor, and very likely
>> (based on my experience) the failure is related to some particular URLs.
>> E.g. regex URL filtering can choke on some pathological URLs, like URLs 20kB
>> long, or containing '\0' etc, etc. In my experience, it's best to keep regex
>> filtering to a minimum if you can, and use other urlfilters (prefix, domain,
>> suffix, custom) to limit your crawling frontier. There are simply too many
>> ways where a regex engine can lock up.
>>
>> Please check the logs of the failing tasks. If you see that a task is
>> stalled you could also log in to the node, and generate a thread dump a few
>> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the regex
>> processing then it's likely this is your problem.
>>
>>
>>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
>>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
>>> Storage: I've performed crawls with HDFS and with amazon S3. I
>>> thought S3 would be more performant, yet it doesn't appear to affect
>>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
>>> to get it done quickly... But I can't imagine I need much more than
>>> 10-20 mid-size instances for this.
>>>
>>
>> That's correct - with this number of unique sites the max. throughput of
>> your crawl will be ultimately limited by the politeness limits (# of
>> requests/site/sec).
>>
>>
>>
>>> Can anyone share their own experiences in the performance they've
>>> seen?
>>>
>>
>> There is a very simple benchmark in trunk/ that you could use to measure
>> the raw performance (data processing throughput) of your EC2 cluster. The
>> real-life performance, though, will depend on many other factors, such as
>> the number of unique sites, their individual speed, and (rarely) the total
>> bandwidth at your end.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Re: Seeking Insight into Nutch Configurations

Posted by Scott Gonyea <sc...@aitrus.org>.

By the way, can anyone tell me if there is a way to explicitly limit how
many pages should be fetched, per fetcher-task?

I know that one limit to this, is that each site/domain/whatever could
exceed that limit (assuming the limit were lower than the number of those
sites).  For politeness, that limit would have to be soft.  But that's more
than suitable, in my opinion.

I think part of the problem is that, seemingly, Nutch seems to be generating
some really unbalanced fetcher tasks.

The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.  Each
higher-numbered task had fewer pages to fetch.  Task 000180 only had 44
pages to fetch.

This *huge* imbalance, I think, creates tasks that are seemingly
unpredictable.  All of my other resources just sit around, wasting
resources, until one task grabs some crazy number of sites.

Thanks again,
sg

On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <me...@sgonyea.com> wrote:

> Thank you very much, Adrzej.  I'm really hoping some people can share some
> non-sensitive details of their setup.  I'm really curious about the
> following:
>
> The ratio of Maps to Reduces for their nutch jobs?
> The amount of memory that they allocate to each job task?
> The number of simultaneous Maps/Reduces on any given host?
> The number of fetcher threads they execute?
>
> Any config setup people can share would be great, so I can have a different
> perspective on how people setup their nutch-site and mapred-site files.
>
> At the moment, I'm experimenting with the following configs:
>
> http://gist.github.com/505065
>
> I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run at
> any given time.  I have Nutch firing off 181 Maps and 41 Reduces.  Those are
> both prime numbers, but I don't know if that really matters.  I've seen
> Hadoop say that the number of reducers should be around the number of nodes
> you have (the nearest prime).  I've seen, somewhere, some suggestions that
> Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have insight
> to share on that?
>
> Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.  I'm
> waiting for it to return to the 4th fetch step, so I can see why Nutch hates
> me so much.
>
> sg
>
> On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
>
>> On 2010-08-02 10:17, Scott Gonyea wrote:
>>
>>> The big problem that I am facing, thus far, occurs on the 4th fetch.
>>> All but 1 or 2 maps complete. All of the running reduces stall (0.00
>>> MB/s), presumably because they are waiting on that map to finish? I
>>> really don't know and it's frustrating.
>>>
>>
>> Yes, all map tasks need to finish before reduce tasks are able to proceed.
>> The reason is that each reduce task receives a portion of the keyspace (and
>> values) according to the Partitioner, and in order to prepare a nice <key,
>> list(value)> in your reducer it needs to, well, get all the values under
>> this key first, whichever map task produced the tuples, and then sort them.
>>
>> The failing tasks probably fail due to some other factor, and very likely
>> (based on my experience) the failure is related to some particular URLs.
>> E.g. regex URL filtering can choke on some pathological URLs, like URLs 20kB
>> long, or containing '\0' etc, etc. In my experience, it's best to keep regex
>> filtering to a minimum if you can, and use other urlfilters (prefix, domain,
>> suffix, custom) to limit your crawling frontier. There are simply too many
>> ways where a regex engine can lock up.
>>
>> Please check the logs of the failing tasks. If you see that a task is
>> stalled you could also log in to the node, and generate a thread dump a few
>> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the regex
>> processing then it's likely this is your problem.
>>
>>
>>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
>>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
>>> Storage: I've performed crawls with HDFS and with amazon S3. I
>>> thought S3 would be more performant, yet it doesn't appear to affect
>>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
>>> to get it done quickly... But I can't imagine I need much more than
>>> 10-20 mid-size instances for this.
>>>
>>
>> That's correct - with this number of unique sites the max. throughput of
>> your crawl will be ultimately limited by the politeness limits (# of
>> requests/site/sec).
>>
>>
>>
>>> Can anyone share their own experiences in the performance they've
>>> seen?
>>>
>>
>> There is a very simple benchmark in trunk/ that you could use to measure
>> the raw performance (data processing throughput) of your EC2 cluster. The
>> real-life performance, though, will depend on many other factors, such as
>> the number of unique sites, their individual speed, and (rarely) the total
>> bandwidth at your end.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Re: Seeking Insight into Nutch Configurations

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-08-02 20:57, Scott Gonyea wrote:
> Thank you very much, Adrzej.  I'm really hoping some people can share
> some non-sensitive details of their setup.  I'm really curious about the
> following:
>
> The ratio of Maps to Reduces for their nutch jobs?

This depends on the job and the amount of data. The more data, the more 
map tasks you will have. The number of reduce tasks is fixed, and it 
should be set so that the sort and reduce operations per reduce task 
should operate on a reasonably-sized chunk of output data. Hence the 
recommendation to set it to something between 1x-2x the number of nodes.

(BTW, the thing about primes is to avoid task scheduling issues, esp. in 
presence of speculative execution... but that's another subject).

> The amount of memory that they allocate to each job task?

Sufficient for the task ;) both map and reduce operate on a fixed memory 
budget, using on-disk iterators - so you need not to worry about the 
total number of records you want to process, just allocate enough memory 
to correctly process a tuple, with some room to spare for Hadoop task 
buffers and optionally a bit more if you use Combiner. All in all, I 
rarely see a good reason to go above 768MB, and often use less than that.

> The number of simultaneous Maps/Reduces on any given host?

Depends on the host - amount of RAM/CPU. I usually use value starting 
from 2 maps (low end hardware) to 4 (regular hardware) to 8 (higher end 
hardware), whatever the low/regular/high means.. Reduce tasks include 
also the sorting, which is IO intensive, so I usually allocate 1-2 per node.

> The number of fetcher threads they execute?

Something between 10-100. With higher values you need a dedicated DNS 
cache - 100 threads all looking up IP-s take their toll...

> I'm giving each task 2048m of memory.

This is likely too much. Not that it really hurts if you have enough 
RAM... but JVMs may be actually less efficient if you give them 
unnecessarily huge heaps.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Seeking Insight into Nutch Configurations

Posted by Scott Gonyea <me...@sgonyea.com>.

Thank you very much, Adrzej.  I'm really hoping some people can share some
non-sensitive details of their setup.  I'm really curious about the
following:

The ratio of Maps to Reduces for their nutch jobs?
The amount of memory that they allocate to each job task?
The number of simultaneous Maps/Reduces on any given host?
The number of fetcher threads they execute?

Any config setup people can share would be great, so I can have a different
perspective on how people setup their nutch-site and mapred-site files.

At the moment, I'm experimenting with the following configs:

http://gist.github.com/505065

I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run at any
given time.  I have Nutch firing off 181 Maps and 41 Reduces.  Those are
both prime numbers, but I don't know if that really matters.  I've seen
Hadoop say that the number of reducers should be around the number of nodes
you have (the nearest prime).  I've seen, somewhere, some suggestions that
Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have insight
to share on that?

Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.  I'm
waiting for it to return to the 4th fetch step, so I can see why Nutch hates
me so much.

sg

On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2010-08-02 10:17, Scott Gonyea wrote:
>
>> The big problem that I am facing, thus far, occurs on the 4th fetch.
>> All but 1 or 2 maps complete. All of the running reduces stall (0.00
>> MB/s), presumably because they are waiting on that map to finish? I
>> really don't know and it's frustrating.
>>
>
> Yes, all map tasks need to finish before reduce tasks are able to proceed.
> The reason is that each reduce task receives a portion of the keyspace (and
> values) according to the Partitioner, and in order to prepare a nice <key,
> list(value)> in your reducer it needs to, well, get all the values under
> this key first, whichever map task produced the tuples, and then sort them.
>
> The failing tasks probably fail due to some other factor, and very likely
> (based on my experience) the failure is related to some particular URLs.
> E.g. regex URL filtering can choke on some pathological URLs, like URLs 20kB
> long, or containing '\0' etc, etc. In my experience, it's best to keep regex
> filtering to a minimum if you can, and use other urlfilters (prefix, domain,
> suffix, custom) to limit your crawling frontier. There are simply too many
> ways where a regex engine can lock up.
>
> Please check the logs of the failing tasks. If you see that a task is
> stalled you could also log in to the node, and generate a thread dump a few
> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the regex
> processing then it's likely this is your problem.
>
>
>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
>> Storage: I've performed crawls with HDFS and with amazon S3. I
>> thought S3 would be more performant, yet it doesn't appear to affect
>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
>> to get it done quickly... But I can't imagine I need much more than
>> 10-20 mid-size instances for this.
>>
>
> That's correct - with this number of unique sites the max. throughput of
> your crawl will be ultimately limited by the politeness limits (# of
> requests/site/sec).
>
>
>
>> Can anyone share their own experiences in the performance they've
>> seen?
>>
>
> There is a very simple benchmark in trunk/ that you could use to measure
> the raw performance (data processing throughput) of your EC2 cluster. The
> real-life performance, though, will depend on many other factors, such as
> the number of unique sites, their individual speed, and (rarely) the total
> bandwidth at your end.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Seeking Insight into Nutch Configurations

Posted by Scott Gonyea <me...@sgonyea.com>.

Thank you very much, Adrzej.  I'm really hoping some people can share some
non-sensitive details of their setup.  I'm really curious about the
following:

The ratio of Maps to Reduces for their nutch jobs?
The amount of memory that they allocate to each job task?
The number of simultaneous Maps/Reduces on any given host?
The number of fetcher threads they execute?

Any config setup people can share would be great, so I can have a different
perspective on how people setup their nutch-site and mapred-site files.

At the moment, I'm experimenting with the following configs:

http://gist.github.com/505065

I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run at any
given time.  I have Nutch firing off 181 Maps and 41 Reduces.  Those are
both prime numbers, but I don't know if that really matters.  I've seen
Hadoop say that the number of reducers should be around the number of nodes
you have (the nearest prime).  I've seen, somewhere, some suggestions that
Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have insight
to share on that?

Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.  I'm
waiting for it to return to the 4th fetch step, so I can see why Nutch hates
me so much.

sg

On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2010-08-02 10:17, Scott Gonyea wrote:
>
>> The big problem that I am facing, thus far, occurs on the 4th fetch.
>> All but 1 or 2 maps complete. All of the running reduces stall (0.00
>> MB/s), presumably because they are waiting on that map to finish? I
>> really don't know and it's frustrating.
>>
>
> Yes, all map tasks need to finish before reduce tasks are able to proceed.
> The reason is that each reduce task receives a portion of the keyspace (and
> values) according to the Partitioner, and in order to prepare a nice <key,
> list(value)> in your reducer it needs to, well, get all the values under
> this key first, whichever map task produced the tuples, and then sort them.
>
> The failing tasks probably fail due to some other factor, and very likely
> (based on my experience) the failure is related to some particular URLs.
> E.g. regex URL filtering can choke on some pathological URLs, like URLs 20kB
> long, or containing '\0' etc, etc. In my experience, it's best to keep regex
> filtering to a minimum if you can, and use other urlfilters (prefix, domain,
> suffix, custom) to limit your crawling frontier. There are simply too many
> ways where a regex engine can lock up.
>
> Please check the logs of the failing tasks. If you see that a task is
> stalled you could also log in to the node, and generate a thread dump a few
> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the regex
> processing then it's likely this is your problem.
>
>
>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
>> Storage: I've performed crawls with HDFS and with amazon S3. I
>> thought S3 would be more performant, yet it doesn't appear to affect
>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
>> to get it done quickly... But I can't imagine I need much more than
>> 10-20 mid-size instances for this.
>>
>
> That's correct - with this number of unique sites the max. throughput of
> your crawl will be ultimately limited by the politeness limits (# of
> requests/site/sec).
>
>
>
>> Can anyone share their own experiences in the performance they've
>> seen?
>>
>
> There is a very simple benchmark in trunk/ that you could use to measure
> the raw performance (data processing throughput) of your EC2 cluster. The
> real-life performance, though, will depend on many other factors, such as
> the number of unique sites, their individual speed, and (rarely) the total
> bandwidth at your end.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Seeking Insight into Nutch Configurations

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-08-02 10:17, Scott Gonyea wrote:
> The big problem that I am facing, thus far, occurs on the 4th fetch.
> All but 1 or 2 maps complete. All of the running reduces stall (0.00
> MB/s), presumably because they are waiting on that map to finish? I
> really don't know and it's frustrating.

Yes, all map tasks need to finish before reduce tasks are able to 
proceed. The reason is that each reduce task receives a portion of the 
keyspace (and values) according to the Partitioner, and in order to 
prepare a nice <key, list(value)> in your reducer it needs to, well, get 
all the values under this key first, whichever map task produced the 
tuples, and then sort them.

The failing tasks probably fail due to some other factor, and very 
likely (based on my experience) the failure is related to some 
particular URLs. E.g. regex URL filtering can choke on some pathological 
URLs, like URLs 20kB long, or containing '\0' etc, etc. In my 
experience, it's best to keep regex filtering to a minimum if you can, 
and use other urlfilters (prefix, domain, suffix, custom) to limit your 
crawling frontier. There are simply too many ways where a regex engine 
can lock up.

Please check the logs of the failing tasks. If you see that a task is 
stalled you could also log in to the node, and generate a thread dump a 
few times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the 
regex processing then it's likely this is your problem.

> My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
> Storage: I've performed crawls with HDFS and with amazon S3. I
> thought S3 would be more performant, yet it doesn't appear to affect
> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
> to get it done quickly... But I can't imagine I need much more than
> 10-20 mid-size instances for this.

That's correct - with this number of unique sites the max. throughput of 
your crawl will be ultimately limited by the politeness limits (# of 
requests/site/sec).

>
> Can anyone share their own experiences in the performance they've
> seen?

There is a very simple benchmark in trunk/ that you could use to measure 
the raw performance (data processing throughput) of your EC2 cluster. 
The real-life performance, though, will depend on many other factors, 
such as the number of unique sites, their individual speed, and (rarely) 
the total bandwidth at your end.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com