You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Leo Galambos <Le...@seznam.cz> on 2003/05/17 03:04:19 UTC

Re: Lucene not used on xml.apache.org -> pagerank

When a tool X is used for xml.apache.org (it is a web site, isn't it?), 
then the tool X must be evaluated as a web tool. This fact has no 
relevance to me, I think. ;-)

You talk about "generic tool", thus I start to read LuceneAPI. I guess, 
you cannot implement dozen features effectively, for instance, the 
pagerank that is discussed by us now.

Option 1: You could use setBoost, then the issue is modification of 
these values - pagerank is not a constant value.

Option 2: You could modify Similarity object, and upload the pagerank 
values from a separate DB. The issue is the handle you need to get the 
value. UID is volatile, so you must use other metadata values - it costs 
time. If you puzzle it out with UID directly (I do not see how), you 
have another issue in Merger, because it is hard-coded, thus you lost 
the joint between UID and other DB here.

Option 3: You could recalculate the pagerank to 1-10 values, and each 
value would establish a separate index. The issue is related to O/S. a) 
you compute hit lists concurrently, then Lucene needs a lot of file 
handles, and it may open DoS. b) you calculate the similarity 
sequentially, then you lost time during .tii/.tis lookup (you have to 
process 10x more barrels).

I may miss something, that's true, but I do not think, that the label 
"generic" automatically means "universal". That is the point.

-g-

Otis Gospodnetic wrote:

>I think Leo keeps thinking that about Lucene as a text indexing tool
>for the web, when in fact it is a generic indexing tool.
>
>Otis
>
>--- Eric Isakson <Er...@sas.com> wrote:
>  
>
>>What is wrong with Document.setBoost(float)
>>/** Sets a boost factor for hits on any field of this document.  This
>>value
>> * will be multiplied into the score of all hits on this document.
>> */
>>
>>If you know you are indexing a certain kind of page and your
>>application using Lucene wants to lower the importance of those
>>pages, set this to some low number like .2 when you are adding them
>>to the index.
>>
>>Eric
>>
>>-----Original Message-----
>>From: Leo Galambos [mailto:Leo.G@seznam.cz] 
>>Sent: Friday, May 16, 2003 11:54 AM
>>To: Lucene Users List
>>Subject: Re: Lucene not used on xml.apache.org
>>
>>
>>    
>>
>>>But they should use Lucene, you are right :)
>>>They don't know what they are missing...
>>> 
>>>
>>>      
>>>
>>Do you mean the hit lists, which often list low rank pages (i.e.
>>JavaDoc 
>>APIs) before solid project pages? ;-)
>>
>>Unfortunately, Lucene does not implement any mechanism that would be 
>>able to eliminate this behaviour IMHO.
>>
>>-g-
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>
>__________________________________
>Do you Yahoo!?
>The New Yahoo! Search - Faster. Easier. Bingo.
>http://search.yahoo.com
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>  
>




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org