You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/12/21 14:14:43 UTC
IndexSorter optimizer
Hi,
I'm happy to report that further tests performed on a larger index seem
to show that the overall impact of the IndexSorter is definitely
positive: performance improvements are significant, and the overall
quality of results seems at least comparable, if not actually better.
The reason why result quality seems better is quite interesting, and it
shows that the simple top-N measures that I was using in my benchmarks
may have been too simplistic.
Using the original index, it was possible for pages with high tf/idf of
a term, but with a low "boost" value (the OPIC score), to outrank pages
with high "boost" but lower tf/idf of a term. This phenomenon leads
quite often to results that are perceived as "junk", e.g. pages with a
lot of repeated terms, but with little other real content, like for
example navigation bars.
Using the optimized index, I reported previously that some of the
top-scoring results were missing. As it happens, the missing results
were typically the "junk" pages with high tf/idf but low "boost". Since
we collect up to N hits, going from higher to lower "boost" values, the
"junk" pages with low "boost" value were automatically eliminated. So,
overall the subjective quality of results was improved. On the other
hand, some of the legitimate results with a decent "boost" values were
also skipped because they didn't fit within the fixed number of hits...
ah, well. Perhaps we should limit the number of hits in LimitedCollector
using a cutoff "boost" value, and not the maximum number of hits (or
maybe both?).
This again brings to attention the importance of the OPIC score: it
represents a query-independent opinion about the quality of the page -
whichever way you calculate it. If you use PageRank, it (allegedly)
corresponds to other people's opinions about the page, thus providing an
"objective" quality opinion. If you use a simple list of
white/black-listed sites that you like/dislike, then it represents your
own subjective opinion on the quality of the site; etc, etc... In this
way, running a search engine that provides "good" results is not just a
plain precision, recall, tf/idf and other tangible measures, it's also a
sort of political statement of the engine's operator. ;-)
To conclude, I will add the IndexSorter.java to the core classes, and I
suggest to continue the experiments ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: IndexSorter optimizer
Posted by Byron Miller <by...@yahoo.com>.
I've got 400mill db i can run this against over the
next few days.
-byron
--- Stefan Groschupf <sg...@media-style.com> wrote:
> Hi Andrzej,
>
> wow are really great news!
> > Using the optimized index, I reported previously
> that some of the
> > top-scoring results were missing. As it happens,
> the missing
> > results were typically the "junk" pages with high
> tf/idf but low
> > "boost". Since we collect up to N hits, going from
> higher to lower
> > "boost" values, the "junk" pages with low "boost"
> value were
> > automatically eliminated. So, overall the
> subjective quality of
> > results was improved. On the other hand, some of
> the legitimate
> > results with a decent "boost" values were also
> skipped because they
> > didn't fit within the fixed number of hits... ah,
> well. Perhaps we
> > should limit the number of hits in
> LimitedCollector using a cutoff
> > "boost" value, and not the maximum number of hits
> (or maybe both?).
>
> As far we experiment it would be good to have booth.
>
> > To conclude, I will add the IndexSorter.java to
> the core classes,
> > and I suggest to continue the experiments ...
>
> May someone out there in the community has a
> commercial search engine
> running (e.g. google appliance or similar) so we may
> can setup a
> nutch with the same pages and compare the results.
> I guess it will be difficult to compare nutch with
> yahoo or google
> since nobody of us has a 4 billion index up and
> running. I would run
> one on my laptop but I do not have the bandwidth to
> fetch until next
> two days. :-D
> Great work!
>
> Cheers,
> Stefan
>
Re: IndexSorter optimizer
Posted by Stefan Groschupf <sg...@media-style.com>.
Hi Andrzej,
wow are really great news!
> Using the optimized index, I reported previously that some of the
> top-scoring results were missing. As it happens, the missing
> results were typically the "junk" pages with high tf/idf but low
> "boost". Since we collect up to N hits, going from higher to lower
> "boost" values, the "junk" pages with low "boost" value were
> automatically eliminated. So, overall the subjective quality of
> results was improved. On the other hand, some of the legitimate
> results with a decent "boost" values were also skipped because they
> didn't fit within the fixed number of hits... ah, well. Perhaps we
> should limit the number of hits in LimitedCollector using a cutoff
> "boost" value, and not the maximum number of hits (or maybe both?).
As far we experiment it would be good to have booth.
> To conclude, I will add the IndexSorter.java to the core classes,
> and I suggest to continue the experiments ...
May someone out there in the community has a commercial search engine
running (e.g. google appliance or similar) so we may can setup a
nutch with the same pages and compare the results.
I guess it will be difficult to compare nutch with yahoo or google
since nobody of us has a 4 billion index up and running. I would run
one on my laptop but I do not have the bandwidth to fetch until next
two days. :-D
Great work!
Cheers,
Stefan
Re: IndexSorter optimizer
Posted by Byron Miller <by...@yahoo.com>.
Great reading and great ideas.
In such a system where you have say 3 segment
partitions is it possible to build a mapreduce job to
efficiently fetch, retreive and update these segments?
Use a map job to process a segment for deletion and
somehow process that segment to create a new fetchlist
from so that you only fetch data that isn't already
deduped because of already being fetched or duplicated
in another non aged out segment.
just brainstorming :)
--- Andrzej Bialecki <ab...@getopt.org> wrote:
> Doug Cutting wrote:
>
> > Byron Miller wrote:
> >
> >> On optimizing performance, does anyone know if
> google
> >> is exporting its entire dataset as an index or
> only
> >> somehow indexing the topN % (since they only show
> the
> >> first 1000 or so results anyway)
> >
> >
> > Both. The highest-scoring pages are kept in
> separate indexes that are
> > searched first. When a query fails to match 1000
> or so documents in
> > the high-scoring indexes then the entire dataset
> is searched. In
> > general there can be multiple levels, e.g.:
> high-scoring, mid-scoring
> > and low-scoring indexes, with the vast majority of
> pages in the last
> > category, and the vast majority of queries
> resolved consulting only
> > the first category.
> >
> > What I have implemented so far for Nutch is a
> single-index version of
> > this. The current index-sorting implementation
> does not yet scale
> > well to indexes larger than ~50M urls. It is a
> proof-of-concept.
> >
> > A better long-term approach is to introduce
> another MapReduce pass
> > that collects Lucene documents (or equivalent) as
> values, and page
> > scores as keys. Then the indexing MapReduce pass
> can partition and
> > sort by score before creating indexes. The
> distributed search code
> > will also need to be modified to search high-score
> indexes first.
>
>
> The WWW2005 conference presented a couple of
> interesting papers on the
> subject (http://www2005.org), among others these:
>
> 1. http://www2005.org/cdrom/docs/p235.pdf
> 2. http://www2005.org/cdrom/docs/p245.pdf
> 3. http://www2005.org/cdrom/docs/p257.pdf
>
> The techniques described in the first paper are not
> too difficult to
> implement, especially the Carmel's method of index
> pruning, which gives
> satisfactory results at moderate costs.
>
> The third paper, by Long & Suel, presents a concept
> of using a cache of
> intersections for multi-term queries, which we
> already sort of use with
> CachingFilters, only they propose to store them
> on-disk instead of
> limiting the cache to relatively small number of
> filters kept in RAM...
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _
> __________________________________
> [__ || __|__/|__||\/| Information Retrieval,
> Semantic Web
> ___|||__|| \| || | Embedded Unix, System
> Integration
> http://www.sigram.com Contact: info at sigram dot
> com
>
>
>
Re: IndexSorter optimizer
Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> Byron Miller wrote:
>
>> On optimizing performance, does anyone know if google
>> is exporting its entire dataset as an index or only
>> somehow indexing the topN % (since they only show the
>> first 1000 or so results anyway)
>
>
> Both. The highest-scoring pages are kept in separate indexes that are
> searched first. When a query fails to match 1000 or so documents in
> the high-scoring indexes then the entire dataset is searched. In
> general there can be multiple levels, e.g.: high-scoring, mid-scoring
> and low-scoring indexes, with the vast majority of pages in the last
> category, and the vast majority of queries resolved consulting only
> the first category.
>
> What I have implemented so far for Nutch is a single-index version of
> this. The current index-sorting implementation does not yet scale
> well to indexes larger than ~50M urls. It is a proof-of-concept.
>
> A better long-term approach is to introduce another MapReduce pass
> that collects Lucene documents (or equivalent) as values, and page
> scores as keys. Then the indexing MapReduce pass can partition and
> sort by score before creating indexes. The distributed search code
> will also need to be modified to search high-score indexes first.
The WWW2005 conference presented a couple of interesting papers on the
subject (http://www2005.org), among others these:
1. http://www2005.org/cdrom/docs/p235.pdf
2. http://www2005.org/cdrom/docs/p245.pdf
3. http://www2005.org/cdrom/docs/p257.pdf
The techniques described in the first paper are not too difficult to
implement, especially the Carmel's method of index pruning, which gives
satisfactory results at moderate costs.
The third paper, by Long & Suel, presents a concept of using a cache of
intersections for multi-term queries, which we already sort of use with
CachingFilters, only they propose to store them on-disk instead of
limiting the cache to relatively small number of filters kept in RAM...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: IndexSorter optimizer
Posted by Doug Cutting <cu...@nutch.org>.
Byron Miller wrote:
> On optimizing performance, does anyone know if google
> is exporting its entire dataset as an index or only
> somehow indexing the topN % (since they only show the
> first 1000 or so results anyway)
Both. The highest-scoring pages are kept in separate indexes that are
searched first. When a query fails to match 1000 or so documents in the
high-scoring indexes then the entire dataset is searched. In general
there can be multiple levels, e.g.: high-scoring, mid-scoring and
low-scoring indexes, with the vast majority of pages in the last
category, and the vast majority of queries resolved consulting only the
first category.
What I have implemented so far for Nutch is a single-index version of
this. The current index-sorting implementation does not yet scale well
to indexes larger than ~50M urls. It is a proof-of-concept.
A better long-term approach is to introduce another MapReduce pass that
collects Lucene documents (or equivalent) as values, and page scores as
keys. Then the indexing MapReduce pass can partition and sort by score
before creating indexes. The distributed search code will also need to
be modified to search high-score indexes first.
Doug
Re: IndexSorter optimizer
Posted by Byron Miller <by...@yahoo.com>.
On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)
With this patch and a top result set in the xml file
does that mean it will stop scanning the index at that
point? Is there a methodology to actually prune the
index on some scaling factor so that a 4 billion page
index can be searchable only 1k results deep on
average?
seems like some sort of method to do the above would
cut your search processing/index size down fairly
well. But it may be a more expensive to post process
to this scale then it is to simply push and let the
query optimize ignore it as needed.. afterall disk
space is getting rather cheap compared to cpu
processing & memory.
--- Doug Cutting <cu...@nutch.org> wrote:
> Andrzej Bialecki wrote:
> > I'm happy to report that further tests performed
> on a larger index seem
> > to show that the overall impact of the IndexSorter
> is definitely
> > positive: performance improvements are
> significant, and the overall
> > quality of results seems at least comparable, if
> not actually better.
>
> Great news!
>
> I will submit the Lucene patches ASAP, now that we
> know they're useful.
>
> Doug
>
Re: IndexSorter optimizer
Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> I'm happy to report that further tests performed on a larger index seem
> to show that the overall impact of the IndexSorter is definitely
> positive: performance improvements are significant, and the overall
> quality of results seems at least comparable, if not actually better.
Great news!
I will submit the Lucene patches ASAP, now that we know they're useful.
Doug
Re: IndexSorter optimizer
Posted by Andrzej Bialecki <ab...@getopt.org>.
American Jeff Bowden wrote:
> Andrzej Bialecki wrote:
>
>> Hi,
>>
>> I'm happy to report that further tests performed on a larger index
>> seem to show that the overall impact of the IndexSorter is definitely
>> positive: performance improvements are significant, and the overall
>> quality of results seems at least comparable, if not actually better.
>
>
>
> This is very interesting. What's the computational complexity and
> disk I/O for index sorting as compared other operations on an index
> (e.g. adding/deleting N documents and running optimize)?
Comparable to optimize(). All index data needs to be read and copied, so
the whole process is I/O bound.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: IndexSorter optimizer
Posted by American Jeff Bowden <jl...@houseofdistraction.com>.
Andrzej Bialecki wrote:
> Hi,
>
> I'm happy to report that further tests performed on a larger index
> seem to show that the overall impact of the IndexSorter is definitely
> positive: performance improvements are significant, and the overall
> quality of results seems at least comparable, if not actually better.
This is very interesting. What's the computational complexity and disk
I/O for index sorting as compared other operations on an index (e.g.
adding/deleting N documents and running optimize)?
Re: IndexSorter optimizer
Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Right, I confused two bugs from different files - the other bug still
> exists in the committed version of the
> LuceneQueryOptimizer.LimitedCollector constructor, instead of
> super(maxHits) it should be super(numHits) - this was actually the bug,
> which was causing that mysterious slowdown for higher values of MAX_HITS.
Okay. I think I've finally fixed this correctly.
Doug
Re: IndexSorter optimizer
Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> I have committed this, along with the LuceneQueryOptimizer changes.
>
> I could only find one place where I was using numDocs() instead of
> maxDoc().
Right, I confused two bugs from different files - the other bug still
exists in the committed version of the
LuceneQueryOptimizer.LimitedCollector constructor, instead of
super(maxHits) it should be super(numHits) - this was actually the bug,
which was causing that mysterious slowdown for higher values of MAX_HITS.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: IndexSorter optimizer
Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
>> Sounds like tf/idf might be de-emphasized in scoring. Perhaps
>> NutchSimilarity.tf() should use log() instead of sqrt() when
>> field==content?
>
> I don't think it's that simple, the OPIC score is what determined this
> behaviour, and it doesn't correspond at all to tf/idf, but to a human
> judgement.
If we think that high-OPIC is more valuable than high-content-tf, then
we should use different functions to damp these. Currently both are
damped with sqrt().
>> I've updated the version of Lucene included with Nutch to have the
>> required patch. Would you like me to commit IndexSorter.java or would
>> you?
>
> Please do it. There are two typos in your version of IndexSorter, you
> used numDocs() in two places instead of maxDoc(), which for indexes with
> deleted docs (after dedup) leads to exceptions.
I have committed this, along with the LuceneQueryOptimizer changes.
I could only find one place where I was using numDocs() instead of maxDoc().
Cheers,
Doug
Re: IndexSorter optimizer
Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> Andrzej Bialecki wrote:
>
>> Using the original index, it was possible for pages with high tf/idf
>> of a term, but with a low "boost" value (the OPIC score), to outrank
>> pages with high "boost" but lower tf/idf of a term. This phenomenon
>> leads quite often to results that are perceived as "junk", e.g. pages
>> with a lot of repeated terms, but with little other real content,
>> like for example navigation bars.
>
>
> Sounds like tf/idf might be de-emphasized in scoring. Perhaps
> NutchSimilarity.tf() should use log() instead of sqrt() when
> field==content?
I don't think it's that simple, the OPIC score is what determined this
behaviour, and it doesn't correspond at all to tf/idf, but to a human
judgement.
>
>> To conclude, I will add the IndexSorter.java to the core classes, and
>> I suggest to continue the experiments ...
>
>
> I've updated the version of Lucene included with Nutch to have the
> required patch. Would you like me to commit IndexSorter.java or would
> you?
Please do it. There are two typos in your version of IndexSorter, you
used numDocs() in two places instead of maxDoc(), which for indexes with
deleted docs (after dedup) leads to exceptions.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: IndexSorter optimizer
Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Using the original index, it was possible for pages with high tf/idf of
> a term, but with a low "boost" value (the OPIC score), to outrank pages
> with high "boost" but lower tf/idf of a term. This phenomenon leads
> quite often to results that are perceived as "junk", e.g. pages with a
> lot of repeated terms, but with little other real content, like for
> example navigation bars.
Sounds like tf/idf might be de-emphasized in scoring. Perhaps
NutchSimilarity.tf() should use log() instead of sqrt() when field==content?
> To conclude, I will add the IndexSorter.java to the core classes, and I
> suggest to continue the experiments ...
I've updated the version of Lucene included with Nutch to have the
required patch. Would you like me to commit IndexSorter.java or would you?
Doug