You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2004/10/15 17:01:32 UTC

Re: your mail

Christoph Mangold wrote:

> Hi Andrzej,
> 
> I found this Email of yours on the lucene newsgroup.

Hi Christoph,

You need to remember that what I described is more a hack than a proper 
solution. However, it worked for me well enough, without the need to 
modify the core of Lucene...

>>There are other tricks you can use here, too... In one of my projects I
>>had a need to store a list of weighted keywords. No problem storing
>>multiple tokens under the same field name, as Erik explained above.
>>However, in Lucene you can only apply a single boost value to a field. I
>>ended up encoding the keywords like "10.0 keyword" and then writing an
>>analyzer which skips the initial numbers when processing this particular
>>field (which was stored, indexed and tokenized).
> 
> 
> I have exactly the problem you describe. Could you provide some details on
> this? In particular the following points are not clear to me:
> 
> - Is 10.0 supposed to be the boost of the keyword?

Yes. However, this boost cannot really be used during the scoring, only 
during the query formulation - because these boost values are not 
indexed, only stored as text. In other words: the value "10.0" is 
ignored during searching, but since in my scenario I would formulate the 
query based on the keywords from the current document, I could use the 
values from the current document to boost terms in the query.

> - Do you name the field "10.0 keyword: house" or the keyword itself (like
> "keyword: 10.0 house")?

The latter.

> - In either case: If you enhance the analyser to skip the value, I don't
> understand how the Scorer can use the boost information afterwords.

Scorer cannot - that's the drawback. I can use it, however, when 
constructing the query based on the current document kyewords.

> 
> It would be great if you would provide some clues about which classes
> should be enhanced etc.

Other approach which you may investigate is to artificially add more 
identical tokens to the field, to increase the number of index postings, 
which will result in increased score. E.g.

keyword: "hello world SEPARATOR hello world"

would give you automatically a boost of 2.0 for all of "hello", "world" 
and "hello world", and would correctly not give a boost for "world 
hello" (because of the inserted SEPARATOR token). If you don't have to 
store the values of these fields then the index size would increase only 
minimally, because the indexed tokens are stored only once, and then 
only their postings (occurences) are stored.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org