You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "byron miller (JIRA)" <ji...@apache.org> on 2005/04/22 15:42:24 UTC
[jira] Created: (NUTCH-50) Benchmarks & Performance goals
Benchmarks & Performance goals
------------------------------
Key: NUTCH-50
URL: http://issues.apache.org/jira/browse/NUTCH-50
Project: Nutch
Type: Task
Components: searcher
Environment: Linux, Windows
Reporter: byron miller
I am interested in developing a strategy and toolset used to benchmark nutch search. Please give your feedback on the following approaches or recommendations for setting standards and goals.
Example test case(s).
JDK 1.4.x 32 bit/Linux Platform
Single Node/2 gigs of memory
Single Index/Segment
1 million pages
-- single node --
JDK 1.4.x 32 bit/Linux Platform
Single Node/2 gigs of memory
Single Index/Segment
10 million pages
JDK 1.4.x 32 bit/Linux Platform
Single Node/2 gigs of memory
Single Index/Segment
10 million pages
-- dual node --
JDK 1.4.2 32 bit/Linux Platform
2 Node/2 gigs of memory
2 Indexes/Segments (1 per node)
1 million pages
JDK 1.4.2 32 bit/Linux Platform
2 Node/2 gigs of memory
2 Indexes/Segments (1 per node)
1 million pages
-- test queries --
* single term
* term AND term
* exact "small phrase"
* lang:en term
* term cluster
--- standards ----
10 results per page
---------------------
For me a testcase will help prove scalability, bottlenecks, application environments, settings and such. The amount of customizations availble is where we need to really look at setting the best base for X amount of documents and some type of scalability scale. For example a 10 node system may only scale x percent better for x reasons and x is the bottleneck for that scenerio.
Test cases would serve multiple purposes for returning performance, response time and application stability.
Tools/possibilities:
* JMX components
* http://grinder.sourceforge.net/
* JMeter
* others???
---------------------
Query "stuffing" - use of dictionary that contains broad & vastly different terms. Something that could be scripted as a "warm up" for production systems as well. Possibly combine terms from our logs of common search queries to use as a benchmark?
What is your feedback/ideas on building a good test case/stress testing system/framework?
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
Re: [jira] Created: (NUTCH-50) Benchmarks & Performance goals
Posted by Michael Nebel <mi...@nebel.de>.
Hi Andrzej,
thanks for your corrections. I simply tried to my express my observations.
> This is slightly incorrect. The summaries are only accessed for the
> first page of results, not for all hits. So, no matter how many hits
> there are, only the currently displayed page needs the summaries.
you're right! I forgot to add: 10 results per page, 1 result per site(!).
>> So I would suggest to use a static set of queries and an identical set
>> of segments to generate the numbers.
> If you repeat the same query twice, of course the results will come back
> faster, because the relevant data will be loaded into the OS disk cache.
ok - between two runs you'll have to take care of this point. But I
think, after 1000 different queries, the OS cache will be of no use any
more, or?
> Related to this, it is also better to use a single merged Lucene index
> than many per-segment indexes - the latter will work as well, but
> performance will be lower, and also there might be weird problems with
> scoring.
without the merged index my system would not answer at all!
But back to my numbers: beside of the test, I used sar (out of the
systat-package under linux) to measure the system parameters. I see
massive disk-i/os at the data-disk (not the swapping-partition!). I
think, this might be one bottleneck, which we should look at more closely.
Regards
Michael
Re: [jira] Created: (NUTCH-50) Benchmarks & Performance goals
Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael Nebel wrote:
> Hi Byron,
>
> for myself, I wanted to know, how many connection my server can handle.
> So I made some tests myself. The observations where suprising but
> logical for me. I used some old desktop-pcs under linux (Pentium3-800
> MHz, ~256 MB, IDE-Disks), so I hadn't much load to generate to see the
> bottlenecks :-)
>
> The response time of nutch is not only dependant on the number of
> parallel requests, but also on the number of responses nutch returns
> (hitrate). The reason is simple: for each hit, the summary is loaded
> from disk. This causes much disk i/o which slowed my server much more
> down than the slow cpu and the low ram.
This is slightly incorrect. The summaries are only accessed for the
first page of results, not for all hits. So, no matter how many hits
there are, only the currently displayed page needs the summaries.
>
> So I would suggest to use a static set of queries and an identical set
> of segments to generate the numbers.
If you repeat the same query twice, of course the results will come back
faster, because the relevant data will be loaded into the OS disk cache.
>
> A interesing number is the responsetime per hit and parallel request. I
> would expect, that the size of the index has an influence on the number
> of hits returned and an influence, how long it takes to locate the
> summaray on disk.
Summaries are located in nearly constant time - each data file in a
segment is accompanied by a small "index" file (note: this is NOT the
Lucene index!), which is loaded entirely in memory. The time to seek in
the file to the right position is then limited in the worst case to the
time it takes to seek between consecutive positions in that index. IIRC
an index entry is created every 128 data entries.
>
> One question I still have is, how does the number and size of segments
> per search-server influence the response time? What is better: many
> small segments or one big. Looking at the servers I use for the tests -
> you can imagine my problem to run this kind of test :-)
Random access to MapFile-s is more or less independent of the data file
size, as explained above. So, I think it's better to have one big
segment than many small ones. What is much more important is to make
sure that the "index" files (mentioned above) are not corrupted - in
such case everything will appear to work correctly, but seeking
performance will be terrible.
Related to this, it is also better to use a single merged Lucene index
than many per-segment indexes - the latter will work as well, but
performance will be lower, and also there might be weird problems with
scoring.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Created: (NUTCH-50) Benchmarks & Performance goals
Posted by Michael Nebel <mi...@nebel.de>.
Hi Byron,
for myself, I wanted to know, how many connection my server can handle.
So I made some tests myself. The observations where suprising but
logical for me. I used some old desktop-pcs under linux (Pentium3-800
MHz, ~256 MB, IDE-Disks), so I hadn't much load to generate to see the
bottlenecks :-)
The response time of nutch is not only dependant on the number of
parallel requests, but also on the number of responses nutch returns
(hitrate). The reason is simple: for each hit, the summary is loaded
from disk. This causes much disk i/o which slowed my server much more
down than the slow cpu and the low ram.
So I would suggest to use a static set of queries and an identical set
of segments to generate the numbers.
A interesing number is the responsetime per hit and parallel request. I
would expect, that the size of the index has an influence on the number
of hits returned and an influence, how long it takes to locate the
summaray on disk.
One question I still have is, how does the number and size of segments
per search-server influence the response time? What is better: many
small segments or one big. Looking at the servers I use for the tests -
you can imagine my problem to run this kind of test :-)
Regards
Michael
--
Michael Nebel
Internet: http://www.netluchs.de/