You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Gonçalo Gaiolas <go...@outsystems.com> on 2006/09/05 12:54:07 UTC

Scoring based on fields and categorization

Hi there,

 

I need to make two changes to Lucene :

 

-          Scoring should take in consideration not only the relevance of
the contents, but also two numerical values in other document fields. For
example, lets assume that the normal score for Document A is 0.33 (as
calculated by Lucene). What I need is that its true score is 0.33 * (value
of field A) * (value of field B). What is the best way to accomplish this?
Ive read that changing the scoring algorithm is difficult and painful. 

-          I need to make sure only one document per Category is retrieved.
Categories are also implemented as index fields. So, for example, if my
search yields two documents with the same Category (lets assume Movies),
only the higher scoring document is returned. Im assuming the easiest way
to implement this is post-processing the fetching process, maybe with a
HitCollector?

 

I really appreciate any help on these subjects,

 

Thanks a lot

 

Gonçalo Gaiolas

Re: obtaining the number of documents stored in a .cfs file

Posted by Andrzej Bialecki <ab...@getopt.org>.

Stanislav Jordanov wrote:
> Suppose I have a bunch of valid .cfs files while the 
> segmens/segments.new file is missing or invalid.
> The task is to 'recover' the present .cfs files into a valid index.
> I think it will be necessary and sufficient to create a segments file 
> that references the .cfs files.
> The only problem I've encountered in generating a vaild and 
> well-formed segments file is that I need to know the number of docs in 
> each cfs file.
> So the couple of questions is:
> Do I have to put the right number of docs for each segments or any 
> (dummy) number will do?

Not sure, but I doubt anything else than a valid number would work.

> If I have to put the right number there, how do I get it having the 
> cfs file?

Look at the size of _xx.f1 file inside CFS file; this is the norms file, 
and its size in bytes is the same as the number of documents in the index.

(You can use CompoundFileReader.list() and fileLength() methods).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: obtaining the number of documents stored in a .cfs file

Posted by Volodymyr Bychkoviak <vb...@i-hypergrid.com>.

One more note:
this should be in package 'org.apache.lucene.index;' because it uses 
some package visible classes :)

Volodymyr Bychkoviak wrote:
> I've used following code to recover index. Note: it only works with 
> .cfs files.
>
>
>    String path = // path to index
>    File file = new File(path);
>    Directory directory = FSDirectory.getDirectory(file, false);
>
>    String[] files = file.list(new FilenameFilter() {
>
>      public boolean accept(File dir, String name) {
>        return name.endsWith(".cfs");
>      }
>
>    });
>      SegmentInfos infos = new SegmentInfos();
>    int counter = 0;
>    for (int i = 0; i < files.length; i++) {
>      String fileName = files[i];
>      String segmentName = fileName.substring(1, 
> fileName.lastIndexOf('.'));
>          int segmentInt = 
> Integer.parseInt(segmentName,Character.MAX_RADIX);
>      counter = Math.max(counter, segmentInt);
>          segmentName = fileName.substring(0, fileName.lastIndexOf('.'));
>          Directory fileReader = new 
> CompoundFileReader(directory,fileName);
>      IndexInput indexStream = fileReader.openInput(segmentName + ".fdx");
>      int size = (int)(indexStream.length() / 8);
>      indexStream.close();
>      fileReader.close();
>          SegmentInfo segmentInfo = new 
> SegmentInfo(segmentName,size,directory);
>      infos.addElement(segmentInfo);
>    }
>
>    infos.counter = counter++;
>      infos.write(directory);
>
> Stanislav Jordanov wrote:
>> Suppose I have a bunch of valid .cfs files while the 
>> segmens/segments.new file is missing or invalid.
>> The task is to 'recover' the present .cfs files into a valid index.
>> I think it will be necessary and sufficient to create a segments file 
>> that references the .cfs files.
>> The only problem I've encountered in generating a vaild and 
>> well-formed segments file is that I need to know the number of docs 
>> in each cfs file.
>> So the couple of questions is:
>> Do I have to put the right number of docs for each segments or any 
>> (dummy) number will do?
>> If I have to put the right number there, how do I get it having the 
>> cfs file?
>>
>> Stanislav
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

-- 
regards,
Volodymyr Bychkoviak

Re: obtaining the number of documents stored in a .cfs file

Posted by Volodymyr Bychkoviak <vb...@i-hypergrid.com>.

one mistake in this code
should be
    infos.counter = ++counter;
instead of
    infos.counter = counter++;

Volodymyr Bychkoviak wrote:
> I've used following code to recover index. Note: it only works with 
> .cfs files.
>
>
>    String path = // path to index
>    File file = new File(path);
>    Directory directory = FSDirectory.getDirectory(file, false);
>
>    String[] files = file.list(new FilenameFilter() {
>
>      public boolean accept(File dir, String name) {
>        return name.endsWith(".cfs");
>      }
>
>    });
>      SegmentInfos infos = new SegmentInfos();
>    int counter = 0;
>    for (int i = 0; i < files.length; i++) {
>      String fileName = files[i];
>      String segmentName = fileName.substring(1, 
> fileName.lastIndexOf('.'));
>          int segmentInt = 
> Integer.parseInt(segmentName,Character.MAX_RADIX);
>      counter = Math.max(counter, segmentInt);
>          segmentName = fileName.substring(0, fileName.lastIndexOf('.'));
>          Directory fileReader = new 
> CompoundFileReader(directory,fileName);
>      IndexInput indexStream = fileReader.openInput(segmentName + ".fdx");
>      int size = (int)(indexStream.length() / 8);
>      indexStream.close();
>      fileReader.close();
>          SegmentInfo segmentInfo = new 
> SegmentInfo(segmentName,size,directory);
>      infos.addElement(segmentInfo);
>    }
>
>    infos.counter = counter++;
>      infos.write(directory);
>
> Stanislav Jordanov wrote:
>> Suppose I have a bunch of valid .cfs files while the 
>> segmens/segments.new file is missing or invalid.
>> The task is to 'recover' the present .cfs files into a valid index.
>> I think it will be necessary and sufficient to create a segments file 
>> that references the .cfs files.
>> The only problem I've encountered in generating a vaild and 
>> well-formed segments file is that I need to know the number of docs 
>> in each cfs file.
>> So the couple of questions is:
>> Do I have to put the right number of docs for each segments or any 
>> (dummy) number will do?
>> If I have to put the right number there, how do I get it having the 
>> cfs file?
>>
>> Stanislav
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

-- 
regards,
Volodymyr Bychkoviak


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: obtaining the number of documents stored in a .cfs file

Posted by Volodymyr Bychkoviak <vb...@i-hypergrid.com>.

I've used following code to recover index. Note: it only works with .cfs 
files.


    String path = // path to index
    File file = new File(path);
    Directory directory = FSDirectory.getDirectory(file, false);

    String[] files = file.list(new FilenameFilter() {

      public boolean accept(File dir, String name) {
        return name.endsWith(".cfs");
      }

    });
   
    SegmentInfos infos = new SegmentInfos();
    int counter = 0;
    for (int i = 0; i < files.length; i++) {
      String fileName = files[i];
      String segmentName = fileName.substring(1, fileName.lastIndexOf('.'));
     
      int segmentInt = Integer.parseInt(segmentName,Character.MAX_RADIX);
      counter = Math.max(counter, segmentInt);
     
      segmentName = fileName.substring(0, fileName.lastIndexOf('.'));
     
      Directory fileReader = new CompoundFileReader(directory,fileName);
      IndexInput indexStream = fileReader.openInput(segmentName + ".fdx");
      int size = (int)(indexStream.length() / 8);
      indexStream.close();
      fileReader.close();
     
      SegmentInfo segmentInfo = new SegmentInfo(segmentName,size,directory);
      infos.addElement(segmentInfo);
    }

    infos.counter = counter++;
   
    infos.write(directory);

Stanislav Jordanov wrote:
> Suppose I have a bunch of valid .cfs files while the 
> segmens/segments.new file is missing or invalid.
> The task is to 'recover' the present .cfs files into a valid index.
> I think it will be necessary and sufficient to create a segments file 
> that references the .cfs files.
> The only problem I've encountered in generating a vaild and 
> well-formed segments file is that I need to know the number of docs in 
> each cfs file.
> So the couple of questions is:
> Do I have to put the right number of docs for each segments or any 
> (dummy) number will do?
> If I have to put the right number there, how do I get it having the 
> cfs file?
>
> Stanislav
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
regards,
Volodymyr Bychkoviak

obtaining the number of documents stored in a .cfs file

Posted by Stanislav Jordanov <st...@sirma.bg>.

Suppose I have a bunch of valid .cfs files while the 
segmens/segments.new file is missing or invalid.
The task is to 'recover' the present .cfs files into a valid index.
I think it will be necessary and sufficient to create a segments file 
that references the .cfs files.
The only problem I've encountered in generating a vaild and well-formed 
segments file is that I need to know the number of docs in each cfs file.
So the couple of questions is:
Do I have to put the right number of docs for each segments or any 
(dummy) number will do?
If I have to put the right number there, how do I get it having the cfs 
file?

Stanislav

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Scoring based on fields and categorization

Posted by karl wettin <ka...@gmail.com>.

On Tue, 2006-09-05 at 13:32 +0100, Gonçalo Gaiolas wrote:
> should this boosting occur during index time or at query time? I'm a
> bit confused as to where should I apply this boost in order to affect
> the results of a search query.

You boost at index time.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highlighting "really" found terms

Posted by Karel Tejnora <ka...@tejnora.cz>.

Not for now, but I'd like to contribute span support soon.

Karel
> An alternative highlighter implementation was recently contributed here:
>    http://issues.apache.org/jira/browse/LUCENE-644?page=all
> I've not had the time to study this alternative in detail (I hope to soon) so I can't say if it will do Spans correctly. 
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highlighting "really" found terms

Posted by Shane <lu...@my-family.us>.

Is your objective to avoid highlighting matching tokens which are not in 
a phrase?  I recently received the request to avoid highlighting single 
tokens which appear in the hit (vs. sequences of matched tokens).
I have just completed a partial re-write of the getBestTextFragments to 
allow this.  Now the calling object can specify the minimum number of 
tokens (default is 1 to replicate the current functionality) that have 
to be in a sequence before the tokens will be highlighted.

I haven't done a whole lot of testing as I finished the code last night, 
but if you are interested I have made the code available (along with a 
patch file) at http://my-family.us/highlighter.  To set the minimum 
sequence size, just call setMinTokenSequence(int) after creating the 
Highlighter object.

Shane

Harini Raghavan wrote:
> I have a requirement to highlight phrases. I came across a reference 
> to this alternate highlighter implementation. But I am unable to see 
> the source files for the same. Can someone please point me to it?
>
> Thanks,
> Harini
>
> mark harwood wrote:
>
>> See here for a thread reviewing the challenges and possible solutions 
>> associated with this problem:
>>   http://www.mail-archive.com/java-user@lucene.apache.org/msg02543.html
>>
>> An alternative highlighter implementation was recently contributed here:
>>   http://issues.apache.org/jira/browse/LUCENE-644?page=all
>> I've not had the time to study this alternative in detail (I hope to 
>> soon) so I can't say if it will do Spans correctly.
>> Cheers
>> Mark
>>
>>
>>
>> ----- Original Message ----
>> From: Pierre Van Ingelandt <pv...@inforama.fr>
>> To: java-user@lucene.apache.org
>> Sent: Tuesday, 5 September, 2006 2:21:56 PM
>> Subject: Highlighting "really" found terms
>>
>> Hello,
>>
>> After a search, I need to highlight only the terms that do "really"
>> correspond to the query.
>> For instance :
>> 1/ I search docs with toto and titi in the SAME sentence (using
>> SpanNotQuery(spanNearQuery({"toto","titi"},99999)),".") )
>> 2/ Then I try to highlight "toto" and "titi" found (I use the 
>> queryscorer
>> from highlight package)
>>
>> Then the problem is that it highlights ALL the titi and toto terms in 
>> the
>> documents. (even if they are not in the same sentence).
>> Is there a way to highlight only the terms really found ?
>>
>> Thanks a lot !
>>
>> Pierre
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>  
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highlighting "really" found terms

Posted by mark harwood <ma...@yahoo.co.uk>.

See here for a thread reviewing the challenges and possible solutions associated with this problem:
   http://www.mail-archive.com/java-user@lucene.apache.org/msg02543.html

An alternative highlighter implementation was recently contributed here:
   http://issues.apache.org/jira/browse/LUCENE-644?page=all
I've not had the time to study this alternative in detail (I hope to soon) so I can't say if it will do Spans correctly. 

Cheers
Mark



----- Original Message ----
From: Pierre Van Ingelandt <pv...@inforama.fr>
To: java-user@lucene.apache.org
Sent: Tuesday, 5 September, 2006 2:21:56 PM
Subject: Highlighting "really" found terms

Hello,

After a search, I need to highlight only the terms that do "really"
correspond to the query.
For instance :
1/ I search docs with toto and titi in the SAME sentence (using
SpanNotQuery(spanNearQuery({"toto","titi"},99999)),".") )
2/ Then I try to highlight "toto" and "titi" found (I use the queryscorer
from highlight package)

Then the problem is that it highlights ALL the titi and toto terms in the
documents. (even if they are not in the same sentence).
Is there a way to highlight only the terms really found ?

Thanks a lot !

Pierre


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Highlighting "really" found terms

Posted by Pierre Van Ingelandt <pv...@inforama.fr>.

Hello,

After a search, I need to highlight only the terms that do "really"
correspond to the query.
For instance :
1/ I search docs with toto and titi in the SAME sentence (using
SpanNotQuery(spanNearQuery({"toto","titi"},99999)),".") )
2/ Then I try to highlight "toto" and "titi" found (I use the queryscorer
from highlight package)

Then the problem is that it highlights ALL the titi and toto terms in the
documents. (even if they are not in the same sentence).
Is there a way to highlight only the terms really found ?

Thanks a lot !

Pierre


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Scoring based on fields and categorization

Posted by Gonçalo Gaiolas <go...@outsystems.com>.

Hi Karl,

Thanks for the super quick response!

One question - should this boosting occur during index time or at query
time? I'm a bit confused as to where should I apply this boost in order to
affect the results of a search query.

Once again thanks a lot!

Gonçalo

-----Original Message-----
From: karl wettin [mailto:karl.wettin@gmail.com] 
Sent: terça-feira, 5 de Setembro de 2006 13:10
To: java-user@lucene.apache.org
Subject: Re: Scoring based on fields and categorization

On Tue, 2006-09-05 at 11:54 +0100, Gonçalo Gaiolas wrote:
> -          Scoring should take in consideration not only the relevance of
> the contents, but also two numerical values in other document fields. For
> example, lets assume that the normal score for Document A is 0.33 (as
> calculated by Lucene). What I need is that its true score is 0.33 *
(value
> of field A) * (value of field B). What is the best way to accomplish this?
> Ive read that changing the scoring algorithm is difficult and painful. 

Indeed you want to stay off the scoring algorithm if you can. It is
probably much eaiser for you to just boost the document based on the
values you have: 

http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.h
tml#setBoost(float)


> -          I need to make sure only one document per Category is
retrieved.
> Categories are also implemented as index fields. So, for example, if my
> search yields two documents with the same Category (lets assume Movies),
> only the higher scoring document is returned. Im assuming the easiest way
> to implement this is post-processing the fetching process, maybe with a
> HitCollector?

Yes, in most cases that would be the way to go.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring based on fields and categorization

Posted by karl wettin <ka...@gmail.com>.

On Tue, 2006-09-05 at 11:54 +0100, Gonçalo Gaiolas wrote:
> -          Scoring should take in consideration not only the relevance of
> the contents, but also two numerical values in other document fields. For
> example, let’s assume that the normal score for Document A is 0.33 (as
> calculated by Lucene). What I need is that it’s true score is 0.33 * (value
> of field A) * (value of field B). What is the best way to accomplish this?
> I’ve read that changing the scoring algorithm is difficult and painful. 

Indeed you want to stay off the scoring algorithm if you can. It is
probably much eaiser for you to just boost the document based on the
values you have: 

http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html#setBoost(float)


> -          I need to make sure only one document per Category is retrieved.
> Categories are also implemented as index fields. So, for example, if my
> search yields two documents with the same Category (let’s assume Movies),
> only the higher scoring document is returned. I’m assuming the easiest way
> to implement this is post-processing the fetching process, maybe with a
> HitCollector?

Yes, in most cases that would be the way to go.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring based on fields and categorization

Posted by Chris Hostetter <ho...@fucit.org>.

: the contents, but also two numerical values in other document fields. For
: example, lets assume that the normal score for Document A is 0.33 (as
: calculated by Lucene). What I need is that its true score is 0.33 * (value
: of field A) * (value of field B). What is the best way to accomplish this?

see also "FunctionQuery" ...

https://issues.apache.org/jira/browse/LUCENE-446




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org