You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by wuqi <ch...@gmail.com> on 2008/06/04 04:46:50 UTC

document segement size and search performance ?

Hi,
As we all know, "parse_text" in the segment will be used by searcher to generate snippets,and I want to know with the two conditions below which should be faster for searcher to retrieve pars_text:
1. 50 Segments * 10,000 pages/segment
2. 5 segment * 100,000 pages/segment 
If we have more segments and less pages per segment ,seems we need to open more segment files,and hence more memory? If more pages in a segment,we might need more time to get certain page out? Find a page from 10,000 pages should be faster than 100,000 pages ?
For a search engine which have about 10M documents, how many segments dir should I have ?

Thanks
-Qi

Re: document segement size and search performance ?

Posted by wuqi <ch...@gmail.com>.

Thank Andrzej for your so detailed answer!!

----- Original Message ----- 
From: "Andrzej Bialecki" <ab...@getopt.org>
To: <nu...@lucene.apache.org>
Sent: Wednesday, June 04, 2008 10:53 PM
Subject: Re: document segement size and search performance ?


> wuqi wrote:
>> Hi,
>> As we all know, "parse_text" in the segment will be used by searcher to generate snippets,and I want to know with the two conditions below which should be faster for searcher to retrieve pars_text:
>> 1. 50 Segments * 10,000 pages/segment
>> 2. 5 segment * 100,000 pages/segment 
> 
> parse_text uses Hadoop MapFile-s. MapFile-s provide fast random access 
> to individual records because they contain an index of keys (and the 
> files themselves are sorted in ascending order of keys). This index 
> (which contains every 128-th key and its position) is fully loaded in 
> memory, and when you want to get a particular record, first this index 
> is searched (using binary search) to determine the correct "region" of a 
> MapFile, and then the region itself is loaded from the disk and searched.
> 
> This means that extremely large MapFile-s may consume a lot of memory 
> (though this can be adjusted by changing the index interval).
> 
> However, "large" usually means record counts in the order of millions. 
> Let's do a quick calculation - assuming the keys here are URLs, each key 
> takes ~50 bytes on average. We load every 128-th key + plus its offset 
> as a long (8 bytes). This means that for 1 mln keys the memory 
> consumption due to the MapFile index will be ~5MB.
> 
> This in turn means that below a certain size (and this threshold is in 
> the order of a few million records or so) it's better to use a single 
> segment instead of multiple segments with the same total number of records.
> 
> 
>> If we have more segments and less pages per segment ,seems we need to open more segment files,and hence more memory? If more pages in a segment,we might need more time to get certain page out? Find a page from 10,000 pages should be faster than 100,000 pages ?
>> For a search engine which have about 10M documents, how many segments dir should I have ?
> 
> Perhaps around 10. Segments larger than 1 mln documents are somewhat 
> inconvenient to process - fetching takes a long time, and if something 
> goes wrong then you lose a large chunk of data.
> 
> You can also split your segments along a different criteria, e.g. one 
> segment per day, or per week.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Re: document segement size and search performance ?

Posted by Andrzej Bialecki <ab...@getopt.org>.

wuqi wrote:
> Hi,
> As we all know, "parse_text" in the segment will be used by searcher to generate snippets,and I want to know with the two conditions below which should be faster for searcher to retrieve pars_text:
> 1. 50 Segments * 10,000 pages/segment
> 2. 5 segment * 100,000 pages/segment 

parse_text uses Hadoop MapFile-s. MapFile-s provide fast random access 
to individual records because they contain an index of keys (and the 
files themselves are sorted in ascending order of keys). This index 
(which contains every 128-th key and its position) is fully loaded in 
memory, and when you want to get a particular record, first this index 
is searched (using binary search) to determine the correct "region" of a 
MapFile, and then the region itself is loaded from the disk and searched.

This means that extremely large MapFile-s may consume a lot of memory 
(though this can be adjusted by changing the index interval).

However, "large" usually means record counts in the order of millions. 
Let's do a quick calculation - assuming the keys here are URLs, each key 
takes ~50 bytes on average. We load every 128-th key + plus its offset 
as a long (8 bytes). This means that for 1 mln keys the memory 
consumption due to the MapFile index will be ~5MB.

This in turn means that below a certain size (and this threshold is in 
the order of a few million records or so) it's better to use a single 
segment instead of multiple segments with the same total number of records.

> If we have more segments and less pages per segment ,seems we need to open more segment files,and hence more memory? If more pages in a segment,we might need more time to get certain page out? Find a page from 10,000 pages should be faster than 100,000 pages ?
> For a search engine which have about 10M documents, how many segments dir should I have ?

Perhaps around 10. Segments larger than 1 mln documents are somewhat 
inconvenient to process - fetching takes a long time, and if something 
goes wrong then you lose a large chunk of data.

You can also split your segments along a different criteria, e.g. one 
segment per day, or per week.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com