You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Michael Nebel <mi...@nebel.de> on 2005/08/04 13:53:15 UTC

Re: IndexOptimizer bug?

Hi,

I fixed the problem with the following patch:

--- IndexOptimizer.java 2005-08-04 12:55:54.000000000 +0200
+++ IndexOptimizer.java.~1.6.~  2005-01-21 00:48:50.000000000 +0100
@@ -138,7 +138,7 @@

          if (score > minScore) {
            sdq.put(new ScoreDoc(doc, score));
-          if (sdq.size() >= count) {               // if sdq overfull
+          if (sdq.size() > count) {               // if sdq overfull
              sdq.pop();                            // remove lowest in sdq
              minScore = ((ScoreDoc)sdq.top()).score; // reset minScore
            }

My index shrinked from 8.5 GB to 0.5 GB. I found no documentation about 
the background of this tool. Can anyone tell me, what's the idea behind?

Regards

	Michael



Andy Liu wrote:

> I believe this tool is unfinished and unsupported.
> 
> On 7/22/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
> 
>>I found an IndexOptimzer in nutch.
>>When I run it, it dorps an exception:
>>....
>>Optimizing url:http from 226957 to 22696
>>Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 22697
>>        at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:46)
>>        at
>>org.apache.nutch.indexer.IndexOptimizer$OptimizingTermPositions.seek(IndexOptimizer.java:153)
>>        at
>>org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325)
>>        at
>>org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296)
>>        at
>>org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270)
>>        at
>>org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234)
>>        at
>>org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
>>        at
>>org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:578)
>>        at
>>org.apache.nutch.indexer.IndexOptimizer.optimize(IndexOptimizer.java:215)
>>        at
>>org.apache.nutch.indexer.IndexOptimizer.main(IndexOptimizer.java:235)
>>


-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/


Re: IndexOptimizer bug?

Posted by Doug Cutting <cu...@nutch.org>.
Michael Nebel wrote:
> The IndexOptimizer uses a different approach. If I read the code right,
> it takes all terms with an idf under a special threshold and reduces the
> entries. So the total number of documents for a search changes. With the
> default configuration only about 10% of the terms stay in the index. So
> the answer to the query "http" get's (much) smaller.
> 
> What I still do not know: yes a smaller index makes the system much 
> fast. But at which price does it come? Which numbers make sense?

IndexOptimizer was part of a never-completed attempt to implement a 
technique somewhat related to what Torsten Suel describes in his 
"Optimized Query Execution in Large Search Engines with Global Page 
Ordering":

http://cis.poly.edu/suel/papers/order.pdf

A majority of search time is spent considering low-scoring documents for 
frequent terms, documents which rarely appear in hit lists.  Suel 
re-sorts document lists in the index by a document score, then simply 
stops searching once a certain number of matches are found.  In theory a 
higher-scoring match could still be found after this point, one with, 
e.g., very large TF values, but in practice this happens rarely.

At this point I don't think it's worth describing how IndexOptimizer fit 
into this in more detail.  Rather it would be better now to simply write 
something that could sort a Lucene index so that document numbers 
increase with some document scoring function.  Or, alternately, to sort 
documents prior to creating the Lucene index.

Doug

Re: IndexOptimizer bug?

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Hi Michael,

Thanks for answer. I think you read the source code right. The tool 
removes low frequency pages from index.
I think: If you would like to use lower number of pages simplier to 
fetch lower number of top pages (generate segments -TopN).

We don't use the indexmerge tool, because if we would like to balance 
segments between backends, it is more works.
When we fetched the number of pages what we would like, we make a dedup, 
and after a prune, and after a OptimizeIndex. All segments have an own 
index.
We count real indexed pages in index, balance between backends. If any 
backends CPU  load more than others, we move a segments from it to 
other. In this case not need reindex, and indexmergetool again.

Regards,
    Ferenc


Michael Nebel wrotte:

> Hi Ferenc,
>
> as far as I understand, your tool removes all deleted pages ("nutch
> prune", "nutch dedup") out of an index and build a new (smaller) one. In
> our workflow we use "nutch prune" at the segment-indexes and then make a
> "nutch merge". So the deleted pages does not occur in our main-index.
> In our scenario, your tool only helps us to tune the segment indexes.
> With our main-index it seems to be of nearly no use... But when changing
> the workflow - first merging - then deleting, OptimizeIndex should be a
> "must do". We've only been lucky to avoid the problems.
>
> The IndexOptimizer uses a different approach. If I read the code right,
> it takes all terms with an idf under a special threshold and reduces the
> entries. So the total number of documents for a search changes. With the
> default configuration only about 10% of the terms stay in the index. So
> the answer to the query "http" get's (much) smaller.
>
> What I still do not know: yes a smaller index makes the system much 
> fast. But at which price does it come? Which numbers make sense?
>
> Regards
>
>     Michael
>
>
>
> yoursoft@freemail.hu wrote:
>
>> Dear Michael,
>>
>> I writed a tool OptimizeIndex.java, this is faster and there aren't 
>> questions: what it is do?
>> After you optimize index with IndexOptimizer, the number of searching 
>> for 'http' is the same?
>>
>> Regards,
>>    Ferenc
>>
>> Michael Nebel wrotte:
>>
>>> Hi,
>>>
>>> I fixed the problem with the following patch:
>>>
>>> --- IndexOptimizer.java 2005-08-04 12:55:54.000000000 +0200
>>> +++ IndexOptimizer.java.~1.6.~  2005-01-21 00:48:50.000000000 +0100
>>> @@ -138,7 +138,7 @@
>>>
>>>          if (score > minScore) {
>>>            sdq.put(new ScoreDoc(doc, score));
>>> -          if (sdq.size() >= count) {               // if sdq overfull
>>> +          if (sdq.size() > count) {               // if sdq overfull
>>>              sdq.pop();                            // remove lowest 
>>> in sdq
>>>              minScore = ((ScoreDoc)sdq.top()).score; // reset minScore
>>>            }
>>>
>>> My index shrinked from 8.5 GB to 0.5 GB. I found no documentation 
>>> about the background of this tool. Can anyone tell me, what's the 
>>> idea behind?
>>>
>>> Regards
>>>
>>>     Michael
>>>
>>>
>>>
>>> Andy Liu wrote:
>>>
>>>> I believe this tool is unfinished and unsupported.
>>>>
>>>> On 7/22/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
>>>>
>>>>> I found an IndexOptimzer in nutch.
>>>>> When I run it, it dorps an exception:
>>>>> ....
>>>>> Optimizing url:http from 226957 to 22696
>>>>> Exception in thread "main" 
>>>>> java.lang.ArrayIndexOutOfBoundsException: 22697
>>>>>        at 
>>>>> org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:46)
>>>>>        at
>>>>> org.apache.nutch.indexer.IndexOptimizer$OptimizingTermPositions.seek(IndexOptimizer.java:153) 
>>>>>
>>>>>        at
>>>>> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325) 
>>>>>
>>>>>        at
>>>>> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296) 
>>>>>
>>>>>        at
>>>>> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270) 
>>>>>
>>>>>        at
>>>>> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234) 
>>>>>
>>>>>        at
>>>>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
>>>>>        at
>>>>> org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:578)
>>>>>        at
>>>>> org.apache.nutch.indexer.IndexOptimizer.optimize(IndexOptimizer.java:215) 
>>>>>
>>>>>        at
>>>>> org.apache.nutch.indexer.IndexOptimizer.main(IndexOptimizer.java:235)
>>>>>
>>>
>>>
>
>


Re: IndexOptimizer bug?

Posted by Michael Nebel <mi...@nebel.de>.
Hi Ferenc,

as far as I understand, your tool removes all deleted pages ("nutch
prune", "nutch dedup") out of an index and build a new (smaller) one. In
our workflow we use "nutch prune" at the segment-indexes and then make a
"nutch merge". So the deleted pages does not occur in our main-index.
In our scenario, your tool only helps us to tune the segment indexes.
With our main-index it seems to be of nearly no use... But when changing
the workflow - first merging - then deleting, OptimizeIndex should be a
"must do". We've only been lucky to avoid the problems.

The IndexOptimizer uses a different approach. If I read the code right,
it takes all terms with an idf under a special threshold and reduces the
entries. So the total number of documents for a search changes. With the
default configuration only about 10% of the terms stay in the index. So
the answer to the query "http" get's (much) smaller.

What I still do not know: yes a smaller index makes the system much 
fast. But at which price does it come? Which numbers make sense?

Regards

	Michael



yoursoft@freemail.hu wrote:

> Dear Michael,
> 
> I writed a tool OptimizeIndex.java, this is faster and there aren't 
> questions: what it is do?
> After you optimize index with IndexOptimizer, the number of searching 
> for 'http' is the same?
> 
> Regards,
>    Ferenc
> 
> Michael Nebel wrotte:
> 
>> Hi,
>>
>> I fixed the problem with the following patch:
>>
>> --- IndexOptimizer.java 2005-08-04 12:55:54.000000000 +0200
>> +++ IndexOptimizer.java.~1.6.~  2005-01-21 00:48:50.000000000 +0100
>> @@ -138,7 +138,7 @@
>>
>>          if (score > minScore) {
>>            sdq.put(new ScoreDoc(doc, score));
>> -          if (sdq.size() >= count) {               // if sdq overfull
>> +          if (sdq.size() > count) {               // if sdq overfull
>>              sdq.pop();                            // remove lowest in 
>> sdq
>>              minScore = ((ScoreDoc)sdq.top()).score; // reset minScore
>>            }
>>
>> My index shrinked from 8.5 GB to 0.5 GB. I found no documentation 
>> about the background of this tool. Can anyone tell me, what's the idea 
>> behind?
>>
>> Regards
>>
>>     Michael
>>
>>
>>
>> Andy Liu wrote:
>>
>>> I believe this tool is unfinished and unsupported.
>>>
>>> On 7/22/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
>>>
>>>> I found an IndexOptimzer in nutch.
>>>> When I run it, it dorps an exception:
>>>> ....
>>>> Optimizing url:http from 226957 to 22696
>>>> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 
>>>> 22697
>>>>        at 
>>>> org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:46)
>>>>        at
>>>> org.apache.nutch.indexer.IndexOptimizer$OptimizingTermPositions.seek(IndexOptimizer.java:153) 
>>>>
>>>>        at
>>>> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325) 
>>>>
>>>>        at
>>>> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296) 
>>>>
>>>>        at
>>>> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270) 
>>>>
>>>>        at
>>>> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234) 
>>>>
>>>>        at
>>>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
>>>>        at
>>>> org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:578)
>>>>        at
>>>> org.apache.nutch.indexer.IndexOptimizer.optimize(IndexOptimizer.java:215) 
>>>>
>>>>        at
>>>> org.apache.nutch.indexer.IndexOptimizer.main(IndexOptimizer.java:235)
>>>>
>>
>>


-- 
Michael Nebel                   Augustenburger Str. 1, 22769 Hamburg
                                 Telefon:   040 / 851 581 45
http://www.nebel.de/            Mobil:     0172 / 41 53 256
http://www.netluchs.de/         E-Mail:    michael@nebel.de


Re: IndexOptimizer bug?

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Dear Michael,

I writed a tool OptimizeIndex.java, this is faster and there aren't 
questions: what it is do?
After you optimize index with IndexOptimizer, the number of searching 
for 'http' is the same?

Regards,
    Ferenc

Michael Nebel wrotte:

> Hi,
>
> I fixed the problem with the following patch:
>
> --- IndexOptimizer.java 2005-08-04 12:55:54.000000000 +0200
> +++ IndexOptimizer.java.~1.6.~  2005-01-21 00:48:50.000000000 +0100
> @@ -138,7 +138,7 @@
>
>          if (score > minScore) {
>            sdq.put(new ScoreDoc(doc, score));
> -          if (sdq.size() >= count) {               // if sdq overfull
> +          if (sdq.size() > count) {               // if sdq overfull
>              sdq.pop();                            // remove lowest in 
> sdq
>              minScore = ((ScoreDoc)sdq.top()).score; // reset minScore
>            }
>
> My index shrinked from 8.5 GB to 0.5 GB. I found no documentation 
> about the background of this tool. Can anyone tell me, what's the idea 
> behind?
>
> Regards
>
>     Michael
>
>
>
> Andy Liu wrote:
>
>> I believe this tool is unfinished and unsupported.
>>
>> On 7/22/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
>>
>>> I found an IndexOptimzer in nutch.
>>> When I run it, it dorps an exception:
>>> ....
>>> Optimizing url:http from 226957 to 22696
>>> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 
>>> 22697
>>>        at 
>>> org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:46)
>>>        at
>>> org.apache.nutch.indexer.IndexOptimizer$OptimizingTermPositions.seek(IndexOptimizer.java:153) 
>>>
>>>        at
>>> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325) 
>>>
>>>        at
>>> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296) 
>>>
>>>        at
>>> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270) 
>>>
>>>        at
>>> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234) 
>>>
>>>        at
>>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
>>>        at
>>> org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:578)
>>>        at
>>> org.apache.nutch.indexer.IndexOptimizer.optimize(IndexOptimizer.java:215) 
>>>
>>>        at
>>> org.apache.nutch.indexer.IndexOptimizer.main(IndexOptimizer.java:235)
>>>
>
>


Re: IndexOptimizer bug?

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
I applied your patch, but the result is the same error message...

Michael Nebel wrotte:

> Hi,
>
> I fixed the problem with the following patch:
>
> --- IndexOptimizer.java 2005-08-04 12:55:54.000000000 +0200
> +++ IndexOptimizer.java.~1.6.~  2005-01-21 00:48:50.000000000 +0100
> @@ -138,7 +138,7 @@
>
>          if (score > minScore) {
>            sdq.put(new ScoreDoc(doc, score));
> -          if (sdq.size() >= count) {               // if sdq overfull
> +          if (sdq.size() > count) {               // if sdq overfull
>              sdq.pop();                            // remove lowest in 
> sdq
>              minScore = ((ScoreDoc)sdq.top()).score; // reset minScore
>            }
>
> My index shrinked from 8.5 GB to 0.5 GB. I found no documentation 
> about the background of this tool. Can anyone tell me, what's the idea 
> behind?
>
> Regards
>
>     Michael
>
>
>
> Andy Liu wrote:
>
>> I believe this tool is unfinished and unsupported.
>>
>> On 7/22/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
>>
>>> I found an IndexOptimzer in nutch.
>>> When I run it, it dorps an exception:
>>> ....
>>> Optimizing url:http from 226957 to 22696
>>> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 
>>> 22697
>>>        at 
>>> org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:46)
>>>        at
>>> org.apache.nutch.indexer.IndexOptimizer$OptimizingTermPositions.seek(IndexOptimizer.java:153) 
>>>
>>>        at
>>> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325) 
>>>
>>>        at
>>> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296) 
>>>
>>>        at
>>> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270) 
>>>
>>>        at
>>> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234) 
>>>
>>>        at
>>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
>>>        at
>>> org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:578)
>>>        at
>>> org.apache.nutch.indexer.IndexOptimizer.optimize(IndexOptimizer.java:215) 
>>>
>>>        at
>>> org.apache.nutch.indexer.IndexOptimizer.main(IndexOptimizer.java:235)
>>>
>
>