You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "byron miller (JIRA)" <ji...@apache.org> on 2005/04/22 15:42:24 UTC

[jira] Created: (NUTCH-50) Benchmarks & Performance goals

Benchmarks & Performance goals
------------------------------

         Key: NUTCH-50
         URL: http://issues.apache.org/jira/browse/NUTCH-50
     Project: Nutch
        Type: Task
  Components: searcher  
 Environment: Linux, Windows
    Reporter: byron miller


I am interested in developing a strategy and toolset used to benchmark nutch search.  Please give your feedback on the following approaches or recommendations for setting standards and goals.

Example test case(s).

JDK 1.4.x 32 bit/Linux Platform
Single Node/2 gigs of memory
Single Index/Segment
1 million pages  

-- single node --

JDK 1.4.x 32 bit/Linux Platform
Single Node/2 gigs of memory
Single Index/Segment
10 million pages

JDK 1.4.x 32 bit/Linux Platform
Single Node/2 gigs of memory
Single Index/Segment
10 million pages

-- dual node --

JDK 1.4.2 32 bit/Linux Platform
2 Node/2 gigs of memory
2 Indexes/Segments (1 per node)
1 million pages

JDK 1.4.2 32 bit/Linux Platform
2 Node/2 gigs of memory
2 Indexes/Segments (1 per node)
1 million pages


-- test queries --

* single term
* term AND term
* exact "small phrase"
* lang:en term
* term cluster

--- standards ----

10 results per page


---------------------

For me a testcase will help prove scalability, bottlenecks, application environments, settings and such.  The amount of customizations availble is where we need to really look at setting the best base for X amount of documents and some type of scalability scale.  For example a 10 node system may only scale x percent better for x reasons and x is the bottleneck for that scenerio.

Test cases would serve multiple purposes for returning performance, response time and application stability. 

Tools/possibilities:

* JMX components
* http://grinder.sourceforge.net/
* JMeter
* others???

---------------------

Query "stuffing" - use of dictionary that contains broad & vastly different terms. Something that could be scripted as a "warm up" for production systems as well.  Possibly combine terms from our logs of common search queries to use as a benchmark?

What is your feedback/ideas on building a good test case/stress testing system/framework?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Re: [jira] Created: (NUTCH-50) Benchmarks & Performance goals

Posted by Michael Nebel <mi...@nebel.de>.
Hi Andrzej,

thanks for your corrections. I simply tried to my express my observations.

> This is slightly incorrect. The summaries are only accessed for the 
> first page of results, not for all hits. So, no matter how many hits 
> there are, only the currently displayed page needs the summaries.

you're right! I forgot to add: 10 results per page, 1 result per site(!).

>> So I would suggest to use a static set of queries and an identical set 
>> of segments to generate the numbers.
> If you repeat the same query twice, of course the results will come back 
> faster, because the relevant data will be loaded into the OS disk cache.

ok - between two runs you'll have to take care of this point. But I 
think, after 1000 different queries, the OS cache will be of no use any 
more, or?

> Related to this, it is also better to use a single merged Lucene index 
> than many per-segment indexes - the latter will work as well, but 
> performance will be lower, and also there might be weird problems with 
> scoring.

without the merged index my system would not answer at all!

But back to my numbers: beside of the test, I used sar (out of the 
systat-package under linux) to measure the system parameters. I see 
massive disk-i/os at the data-disk (not the swapping-partition!). I 
think, this might be one bottleneck, which we should look at more closely.

Regards

	Michael







Re: [jira] Created: (NUTCH-50) Benchmarks & Performance goals

Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael Nebel wrote:
> Hi Byron,
> 
> for myself, I wanted to know, how many connection my server can handle. 
> So I made some tests myself. The observations where suprising but 
> logical for me. I used some old desktop-pcs under linux (Pentium3-800 
> MHz, ~256 MB, IDE-Disks), so I hadn't much load to generate to see the 
> bottlenecks :-)
> 
> The response time of nutch is not only dependant on the number of 
> parallel requests, but also on the number of responses nutch returns 
> (hitrate). The reason is simple: for each hit, the summary is loaded 
> from disk. This causes  much disk i/o which slowed my server much more 
> down than the slow cpu and the low ram.

This is slightly incorrect. The summaries are only accessed for the 
first page of results, not for all hits. So, no matter how many hits 
there are, only the currently displayed page needs the summaries.

> 
> So I would suggest to use a static set of queries and an identical set 
> of segments to generate the numbers.

If you repeat the same query twice, of course the results will come back 
faster, because the relevant data will be loaded into the OS disk cache.

> 
> A interesing number is the responsetime per hit and parallel request. I 
> would expect, that the size of the index has an influence on the number 
> of hits returned and an influence, how long it takes to locate the 
> summaray on disk.

Summaries are located in nearly constant time - each data file in a 
segment is accompanied by a small "index" file (note: this is NOT the 
Lucene index!), which is loaded entirely in memory. The time to seek in 
the file to the right position is then limited in the worst case to the 
time it takes to seek between consecutive positions in that index. IIRC 
an index entry is created every 128 data entries.

> 
> One question I still have is, how does the number and size of segments 
> per search-server influence the response time? What is better: many 
> small segments or one big. Looking at the servers I use for the tests -
> you can imagine my problem to run this kind of test :-)

Random access to MapFile-s is more or less independent of the data file 
size, as explained above. So, I think it's better to have one big 
segment than many small ones. What is much more important is to make 
sure that the "index" files (mentioned above) are not corrupted - in 
such case everything will appear to work correctly, but seeking 
performance will be terrible.

Related to this, it is also better to use a single merged Lucene index 
than many per-segment indexes - the latter will work as well, but 
performance will be lower, and also there might be weird problems with 
scoring.


-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: [jira] Created: (NUTCH-50) Benchmarks & Performance goals

Posted by Michael Nebel <mi...@nebel.de>.
Hi Byron,

for myself, I wanted to know, how many connection my server can handle. 
So I made some tests myself. The observations where suprising but 
logical for me. I used some old desktop-pcs under linux (Pentium3-800 
MHz, ~256 MB, IDE-Disks), so I hadn't much load to generate to see the 
bottlenecks :-)

The response time of nutch is not only dependant on the number of 
parallel requests, but also on the number of responses nutch returns 
(hitrate). The reason is simple: for each hit, the summary is loaded 
from disk. This causes  much disk i/o which slowed my server much more 
down than the slow cpu and the low ram.

So I would suggest to use a static set of queries and an identical set 
of segments to generate the numbers.

A interesing number is the responsetime per hit and parallel request. I 
would expect, that the size of the index has an influence on the number 
of hits returned and an influence, how long it takes to locate the 
summaray on disk.

One question I still have is, how does the number and size of segments 
per search-server influence the response time? What is better: many 
small segments or one big. Looking at the servers I use for the tests -
you can imagine my problem to run this kind of test :-)

Regards

	Michael

-- 
Michael Nebel
Internet: http://www.netluchs.de/