You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/12/21 14:14:43 UTC

IndexSorter optimizer

Hi,

I'm happy to report that further tests performed on a larger index seem 
to show that the overall impact of the IndexSorter is definitely 
positive: performance improvements are significant, and the overall 
quality of results seems at least comparable, if not actually better.

The reason why result quality seems better is quite interesting, and it 
shows that the simple top-N measures that I was using in my benchmarks 
may have been too simplistic.

Using the original index, it was possible for pages with high tf/idf of 
a term, but with a low "boost" value (the OPIC score), to outrank pages 
with high "boost" but lower tf/idf of a term. This phenomenon leads 
quite often to results that are perceived as "junk", e.g. pages with a 
lot of repeated terms, but with little other real content, like for 
example navigation bars.

Using the optimized index, I reported previously that some of the 
top-scoring results were missing. As it happens, the missing results 
were typically the "junk" pages with high tf/idf but low "boost". Since 
we collect up to N hits, going from higher to lower "boost" values, the 
"junk" pages with low "boost" value were automatically eliminated. So, 
overall the subjective quality of results was improved. On the other 
hand, some of the legitimate results with a decent "boost" values were 
also skipped because they didn't fit within the fixed number of hits... 
ah, well. Perhaps we should limit the number of hits in LimitedCollector 
using a cutoff "boost" value, and not the maximum number of hits (or 
maybe both?).

This again brings to attention the importance of the OPIC score: it 
represents a query-independent opinion about the quality of the page - 
whichever way you calculate it. If you use PageRank, it (allegedly) 
corresponds to other people's opinions about the page, thus providing an 
"objective" quality opinion. If you use a simple list of 
white/black-listed sites that you like/dislike, then it represents your 
own subjective opinion on the quality of the site; etc, etc... In this 
way, running a search engine that provides "good" results is not just a 
plain precision, recall, tf/idf and other tangible measures, it's also a 
sort of political statement of the engine's operator. ;-)

To conclude, I will add the IndexSorter.java to the core classes, and I 
suggest to continue the experiments ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: IndexSorter optimizer

Posted by Byron Miller <by...@yahoo.com>.

I've got 400mill db i can run this against over the
next few days.

-byron

--- Stefan Groschupf <sg...@media-style.com> wrote:

> Hi Andrzej,
> 
> wow are really great news!
> > Using the optimized index, I reported previously
> that some of the  
> > top-scoring results were missing. As it happens,
> the missing  
> > results were typically the "junk" pages with high
> tf/idf but low  
> > "boost". Since we collect up to N hits, going from
> higher to lower  
> > "boost" values, the "junk" pages with low "boost"
> value were  
> > automatically eliminated. So, overall the
> subjective quality of  
> > results was improved. On the other hand, some of
> the legitimate  
> > results with a decent "boost" values were also
> skipped because they  
> > didn't fit within the fixed number of hits... ah,
> well. Perhaps we  
> > should limit the number of hits in
> LimitedCollector using a cutoff  
> > "boost" value, and not the maximum number of hits
> (or maybe both?).
> 
> As far we experiment it would be good to have booth.
> 
> > To conclude, I will add the IndexSorter.java to
> the core classes,  
> > and I suggest to continue the experiments ...
> 
> May someone out there in the community has a
> commercial search engine  
> running (e.g. google appliance or similar) so we may
> can setup a  
> nutch with the same pages and compare the results.
> I guess it will be difficult to compare nutch with
> yahoo or google  
> since nobody of us has a 4 billion index up and
> running. I would run  
> one on my laptop but I do not have the bandwidth to
> fetch until next  
> two days. :-D
> Great work!
> 
> Cheers,
> Stefan 
>

Re: IndexSorter optimizer

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi Andrzej,

wow are really great news!
> Using the optimized index, I reported previously that some of the  
> top-scoring results were missing. As it happens, the missing  
> results were typically the "junk" pages with high tf/idf but low  
> "boost". Since we collect up to N hits, going from higher to lower  
> "boost" values, the "junk" pages with low "boost" value were  
> automatically eliminated. So, overall the subjective quality of  
> results was improved. On the other hand, some of the legitimate  
> results with a decent "boost" values were also skipped because they  
> didn't fit within the fixed number of hits... ah, well. Perhaps we  
> should limit the number of hits in LimitedCollector using a cutoff  
> "boost" value, and not the maximum number of hits (or maybe both?).

As far we experiment it would be good to have booth.

> To conclude, I will add the IndexSorter.java to the core classes,  
> and I suggest to continue the experiments ...

May someone out there in the community has a commercial search engine  
running (e.g. google appliance or similar) so we may can setup a  
nutch with the same pages and compare the results.
I guess it will be difficult to compare nutch with yahoo or google  
since nobody of us has a 4 billion index up and running. I would run  
one on my laptop but I do not have the bandwidth to fetch until next  
two days. :-D
Great work!

Cheers,
Stefan

Re: IndexSorter optimizer

Posted by Byron Miller <by...@yahoo.com>.

Great reading and great ideas.

In such a system where you have say 3 segment
partitions is it possible to build a mapreduce job to
efficiently fetch, retreive and update these segments?

Use a map job to process a segment for deletion and
somehow process that segment to create a new fetchlist
from so that you  only fetch data that isn't already
deduped because of already being fetched or duplicated
in another non aged out segment.

just brainstorming :) 



--- Andrzej Bialecki <ab...@getopt.org> wrote:

> Doug Cutting wrote:
> 
> > Byron Miller wrote:
> >
> >> On optimizing performance, does anyone know if
> google
> >> is exporting its entire dataset as an index or
> only
> >> somehow indexing the topN % (since they only show
> the
> >> first 1000 or so results anyway)
> >
> >
> > Both.  The highest-scoring pages are kept in
> separate indexes that are 
> > searched first.  When a query fails to match 1000
> or so documents in 
> > the high-scoring indexes then the entire dataset
> is searched.  In 
> > general there can be multiple levels, e.g.:
> high-scoring, mid-scoring 
> > and low-scoring indexes, with the vast majority of
> pages in the last 
> > category, and the vast majority of queries
> resolved consulting only 
> > the first category.
> >
> > What I have implemented so far for Nutch is a
> single-index version of 
> > this.  The current index-sorting implementation
> does not yet scale 
> > well to indexes larger than ~50M urls.  It is a
> proof-of-concept.
> >
> > A better long-term approach is to introduce
> another MapReduce pass 
> > that collects Lucene documents (or equivalent) as
> values, and page 
> > scores as keys.  Then the indexing MapReduce pass
> can partition and 
> > sort by score before creating indexes.  The
> distributed search code 
> > will also need to be modified to search high-score
> indexes first.
> 
> 
> The WWW2005 conference presented a couple of
> interesting papers on the 
> subject (http://www2005.org), among others these:
> 
> 1. http://www2005.org/cdrom/docs/p235.pdf
> 2. http://www2005.org/cdrom/docs/p245.pdf
> 3. http://www2005.org/cdrom/docs/p257.pdf
> 
> The techniques described in the first paper are not
> too difficult to 
> implement, especially the Carmel's method of index
> pruning, which gives 
> satisfactory results at moderate costs.
> 
> The third paper, by Long & Suel, presents a concept
> of using a cache of 
> intersections for multi-term queries, which we
> already sort of use with 
> CachingFilters, only they propose to store them
> on-disk instead of 
> limiting the cache to relatively small number of
> filters kept in RAM...
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
>

Re: IndexSorter optimizer

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:

> Byron Miller wrote:
>
>> On optimizing performance, does anyone know if google
>> is exporting its entire dataset as an index or only
>> somehow indexing the topN % (since they only show the
>> first 1000 or so results anyway)
>
>
> Both.  The highest-scoring pages are kept in separate indexes that are 
> searched first.  When a query fails to match 1000 or so documents in 
> the high-scoring indexes then the entire dataset is searched.  In 
> general there can be multiple levels, e.g.: high-scoring, mid-scoring 
> and low-scoring indexes, with the vast majority of pages in the last 
> category, and the vast majority of queries resolved consulting only 
> the first category.
>
> What I have implemented so far for Nutch is a single-index version of 
> this.  The current index-sorting implementation does not yet scale 
> well to indexes larger than ~50M urls.  It is a proof-of-concept.
>
> A better long-term approach is to introduce another MapReduce pass 
> that collects Lucene documents (or equivalent) as values, and page 
> scores as keys.  Then the indexing MapReduce pass can partition and 
> sort by score before creating indexes.  The distributed search code 
> will also need to be modified to search high-score indexes first.


The WWW2005 conference presented a couple of interesting papers on the 
subject (http://www2005.org), among others these:

1. http://www2005.org/cdrom/docs/p235.pdf
2. http://www2005.org/cdrom/docs/p245.pdf
3. http://www2005.org/cdrom/docs/p257.pdf

The techniques described in the first paper are not too difficult to 
implement, especially the Carmel's method of index pruning, which gives 
satisfactory results at moderate costs.

The third paper, by Long & Suel, presents a concept of using a cache of 
intersections for multi-term queries, which we already sort of use with 
CachingFilters, only they propose to store them on-disk instead of 
limiting the cache to relatively small number of filters kept in RAM...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: IndexSorter optimizer

Posted by Doug Cutting <cu...@nutch.org>.

Byron Miller wrote:
> On optimizing performance, does anyone know if google
> is exporting its entire dataset as an index or only
> somehow indexing the topN % (since they only show the
> first 1000 or so results anyway)

Both.  The highest-scoring pages are kept in separate indexes that are 
searched first.  When a query fails to match 1000 or so documents in the 
high-scoring indexes then the entire dataset is searched.  In general 
there can be multiple levels, e.g.: high-scoring, mid-scoring and 
low-scoring indexes, with the vast majority of pages in the last 
category, and the vast majority of queries resolved consulting only the 
first category.

What I have implemented so far for Nutch is a single-index version of 
this.  The current index-sorting implementation does not yet scale well 
to indexes larger than ~50M urls.  It is a proof-of-concept.

A better long-term approach is to introduce another MapReduce pass that 
collects Lucene documents (or equivalent) as values, and page scores as 
keys.  Then the indexing MapReduce pass can partition and sort by score 
before creating indexes.  The distributed search code will also need to 
be modified to search high-score indexes first.

Doug

Re: IndexSorter optimizer

Posted by Byron Miller <by...@yahoo.com>.

On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)

With this patch and a top result set in the xml file
does that mean it will stop scanning the index at that
point?  Is there a methodology to actually prune the
index on some scaling factor so that a  4 billion page
index can be searchable only 1k results deep on
average?

seems like some sort of method to do the above would
cut your search processing/index size down fairly
well. But it may be a more expensive to post process
to this scale then it is to simply push and let the
query optimize ignore it as needed.. afterall disk
space is getting rather cheap compared to cpu
processing & memory.

--- Doug Cutting <cu...@nutch.org> wrote:

> Andrzej Bialecki wrote:
> > I'm happy to report that further tests performed
> on a larger index seem 
> > to show that the overall impact of the IndexSorter
> is definitely 
> > positive: performance improvements are
> significant, and the overall 
> > quality of results seems at least comparable, if
> not actually better.
> 
> Great news!
> 
> I will submit the Lucene patches ASAP, now that we
> know they're useful.
> 
> Doug
>

Re: IndexSorter optimizer

Posted by Doug Cutting <cu...@nutch.org>.

Andrzej Bialecki wrote:
> I'm happy to report that further tests performed on a larger index seem 
> to show that the overall impact of the IndexSorter is definitely 
> positive: performance improvements are significant, and the overall 
> quality of results seems at least comparable, if not actually better.

Great news!

I will submit the Lucene patches ASAP, now that we know they're useful.

Doug

Re: IndexSorter optimizer

Posted by Andrzej Bialecki <ab...@getopt.org>.

American Jeff Bowden wrote:

> Andrzej Bialecki wrote:
>
>> Hi,
>>
>> I'm happy to report that further tests performed on a larger index 
>> seem to show that the overall impact of the IndexSorter is definitely 
>> positive: performance improvements are significant, and the overall 
>> quality of results seems at least comparable, if not actually better.
>
>
>
> This is very interesting.  What's the computational complexity and 
> disk I/O for index sorting as compared other operations on an index 
> (e.g. adding/deleting N documents and running optimize)?


Comparable to optimize(). All index data needs to be read and copied, so 
the whole process is I/O bound.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: IndexSorter optimizer

Posted by American Jeff Bowden <jl...@houseofdistraction.com>.

Andrzej Bialecki wrote:

> Hi,
>
> I'm happy to report that further tests performed on a larger index 
> seem to show that the overall impact of the IndexSorter is definitely 
> positive: performance improvements are significant, and the overall 
> quality of results seems at least comparable, if not actually better.

This is very interesting.  What's the computational complexity and disk 
I/O for index sorting as compared other operations on an index (e.g. 
adding/deleting N documents and running optimize)?

Re: IndexSorter optimizer

Posted by Doug Cutting <cu...@nutch.org>.

Andrzej Bialecki wrote:
> Right, I confused two bugs from different files - the other bug still 
> exists in the committed version of the 
> LuceneQueryOptimizer.LimitedCollector constructor, instead of 
> super(maxHits) it should be super(numHits) - this was actually the bug, 
> which was causing that mysterious slowdown for  higher values of MAX_HITS.

Okay.  I think I've finally fixed this correctly.

Doug

Re: IndexSorter optimizer

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:

> I have committed this, along with the LuceneQueryOptimizer changes.
>
> I could only find one place where I was using numDocs() instead of 
> maxDoc().

Right, I confused two bugs from different files - the other bug still 
exists in the committed version of the 
LuceneQueryOptimizer.LimitedCollector constructor, instead of 
super(maxHits) it should be super(numHits) - this was actually the bug, 
which was causing that mysterious slowdown for  higher values of MAX_HITS.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: IndexSorter optimizer

Posted by Doug Cutting <cu...@nutch.org>.

Andrzej Bialecki wrote:
>> Sounds like tf/idf might be de-emphasized in scoring.  Perhaps 
>> NutchSimilarity.tf() should use log() instead of sqrt() when 
>> field==content?
> 
> I don't think it's that simple, the OPIC score is what determined this 
> behaviour, and it doesn't correspond at all to tf/idf, but to a human 
> judgement.

If we think that high-OPIC is more valuable than high-content-tf, then 
we should use different functions to damp these.  Currently both are 
damped with sqrt().

>> I've updated the version of Lucene included with Nutch to have the 
>> required patch.  Would you like me to commit IndexSorter.java or would 
>> you?
> 
> Please do it. There are two typos in your version of IndexSorter, you 
> used numDocs() in two places instead of maxDoc(), which for indexes with 
> deleted docs (after dedup) leads to exceptions.

I have committed this, along with the LuceneQueryOptimizer changes.

I could only find one place where I was using numDocs() instead of maxDoc().

Cheers,

Doug

Re: IndexSorter optimizer

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> Using the original index, it was possible for pages with high tf/idf 
>> of a term, but with a low "boost" value (the OPIC score), to outrank 
>> pages with high "boost" but lower tf/idf of a term. This phenomenon 
>> leads quite often to results that are perceived as "junk", e.g. pages 
>> with a lot of repeated terms, but with little other real content, 
>> like for example navigation bars.
>
>
> Sounds like tf/idf might be de-emphasized in scoring.  Perhaps 
> NutchSimilarity.tf() should use log() instead of sqrt() when 
> field==content?


I don't think it's that simple, the OPIC score is what determined this 
behaviour, and it doesn't correspond at all to tf/idf, but to a human 
judgement.

>
>> To conclude, I will add the IndexSorter.java to the core classes, and 
>> I suggest to continue the experiments ...
>
>
> I've updated the version of Lucene included with Nutch to have the 
> required patch.  Would you like me to commit IndexSorter.java or would 
> you?


Please do it. There are two typos in your version of IndexSorter, you 
used numDocs() in two places instead of maxDoc(), which for indexes with 
deleted docs (after dedup) leads to exceptions.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: IndexSorter optimizer

Posted by Doug Cutting <cu...@nutch.org>.

Andrzej Bialecki wrote:
> Using the original index, it was possible for pages with high tf/idf of 
> a term, but with a low "boost" value (the OPIC score), to outrank pages 
> with high "boost" but lower tf/idf of a term. This phenomenon leads 
> quite often to results that are perceived as "junk", e.g. pages with a 
> lot of repeated terms, but with little other real content, like for 
> example navigation bars.

Sounds like tf/idf might be de-emphasized in scoring.  Perhaps 
NutchSimilarity.tf() should use log() instead of sqrt() when field==content?

> To conclude, I will add the IndexSorter.java to the core classes, and I 
> suggest to continue the experiments ...

I've updated the version of Lucene included with Nutch to have the 
required patch.  Would you like me to commit IndexSorter.java or would you?

Doug