You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2008/03/09 02:00:23 UTC

Re: Date sorting problem [ IndexSearcher | Hits | Sort | Float ]

I'm pretty sure your problem is that you're sorting as a Float. The three
values you use all are evaluated (according to the Sort doc) by
Float.valueOf. which is 1.20500099E12 for all three values you use.

Why are you using Float as your sortField? If your DATE fields are
normalized, string sorting would work just fine.......

Best
Erick

P.S. Sorry if this comes through multiple times, but my connection is being
wonky

On Sat, Mar 8, 2008 at 1:57 PM, legrand thomas <th...@yahoo.fr>
wrote:

> Dear all,
>
> I'm trying to sort query results using a date criteria. My dates are
> stored as "long" in the database (I cannot change this) and indexed as
> untokenized. The sorted resuIts I get aren't consistent. This problem does
> not occur if the number are "smaller".
>
> Am I doing something wrong ? Is it possible to sort using "long" type ?
> What else should I do ?
>
> Regards & thanks in advance,
> Tom
>
>
> public void testDateSort(){
>        System.out.println("[testDateSort][begin]");
>
>        IndexWriter mAdIndexWriter=null;
>        Directory adIndexDir=null;
>
>         Document doc0 = new Document();
>         doc0.add(new Field("ID","doc0", Field.Store.YES,
> Field.Index.TOKENIZED));
>         doc0.add(new Field("DATE","1205000950000", Field.Store.YES,
> Field.Index.UN_TOKENIZED));
>
>         Document doc1= new Document();
>         doc1.add(new Field("ID","doc1", Field.Store.YES,
> Field.Index.TOKENIZED));
>         doc1.add(new Field("DATE","1205000950001", Field.Store.YES,
> Field.Index.UN_TOKENIZED));
>
>         Document doc2 = new Document();
>         doc2.add(new Field("ID","doc2", Field.Store.YES,
> Field.Index.TOKENIZED));
>         doc2.add(new Field("DATE","1205000950002", Field.Store.YES,
> Field.Index.UN_TOKENIZED));
>
>        Analyzer analyser = new SimpleAnalyzer();
>        try{
>            adIndexDir=FSDirectory.getDirectory
> ("C:\\YourFavoriteDirectory");
>            mAdIndexWriter = new IndexWriter(adIndexDir, analyser, true);
>
>            mAdIndexWriter. addDocument(doc0);
>            mAdIndexWriter. addDocument(doc2);
>            mAdIndexWriter. addDocument(doc1);
>
>            mAdIndexWriter.optimize();
>            mAdIndexWriter.close();
>
>            IndexReader mAdIndexReader= IndexReader.open(adIndexDir);
>            IndexSearcher searcher = new IndexSearcher(mAdIndexReader);
>
>            Query  query=new FuzzyQuery(new Term("ID","doc"),
> Float.parseFloat("0.8"));
>            Sort timeSorter=new Sort(new SortField("DATE",SortField.FLOAT,
> false));
>            Hits allTheHits=searcher.search(query,timeSorter);
>
>            for(int i = 0; i <allTheHits.length(); i++){
>                System.out.println("Date n°" + i + " = "
> +allTheHits.doc(i).get("DATE"));
>            }
>
>            mAdIndexReader.close();
>        }catch(Exception ex){
>            ex.printStackTrace();
>            fail();
>        }
>        System.out.println("[testDateSort][end]");
>    }
>
>
>
> ---------------------------------
>  Envoyé avec Yahoo! Mail.
> Une boite mail plus intelligente.
>

Re: Scoring a query with OR's

Posted by Chris Hostetter <ho...@fucit.org>.
: I emailed a question earlier about the difference between OR and AND in a
: Boolean query. So in what I am trying to do, I need AND to behave like an OR (
: or what I like to call "soft AND"), and I need OR to behave like a logic OR,
: meaning that I don't want to reward documents that have more of the OR
: operands. It is easy for me to fix the AND, but is there a straightforward way
: of fixing the OR?

Assuming I understand you correctly, i think the DisjunctionMaxQuery will 
do what you want in your "OR" case if you set the tiebreaker value to 0.0f 
... then the score of hte final query will consist solely of the score of 
the highest scoring clause.

You may also need to change the coord function of your Similarity, ... off 
the top of my head i'm not certain.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Scoring a query with OR's

Posted by Ghinwa Choueiter <gh...@csail.mit.edu>.
Hi,

I emailed a question earlier about the difference between OR and AND in a 
Boolean query. So in what I am trying to do, I need AND to behave like an 
OR ( or what I like to call "soft AND"), and I need OR to behave like a 
logic OR, meaning that I don't want to reward documents that have more of 
the OR operands. It is easy for me to fix the AND, but is there a 
straightforward way of fixing the OR?

Many thanks!
-Ghinwa

On Sun, 9 Mar 2008, Mark Miller wrote:

> I have been trying to understand all of this better myself, so while I am no 
> expert, here is my take:
>
> Lucene is really a combined Vector Space / Boolean Model search engine.
>
> At its core, Lucene is essentially a Vector Space Model search engine: 
> scoring is done by comparing a query term vector to each of the document term 
> vectors. However, on top of this, Lucene allows a Boolean Model by 
> constraining results using a BooleanQuery.
>
> So when Lucene finds the score for "mark OR mandy", the idea is the same as 
> for "mark AND mandy". The difference is that the BooleanQuery will treat the 
> Must and Should clause differently: if a term is labeled Must but is not in 
> the document, the document won't match. If a Should term is not in the 
> document, the BooleanQuery excludes no extra documents on that account, but 
> the term may contribute 0 towards the similarity score. The BooleanQuery kind 
> of clamps down on top of the Vector Space TermVector similarity scoring, 
> allowing for a hybrid system.
>
> The coord factor essentially juices the term vector similarity score based on 
> how many query terms are in the document. Term overlap is already taken into 
> account during the term vector similarity part, but apparently users don't 
> like how that ranks eg users intuitively think that sharing more terms 
> between document and query is more important than sharing fewer very highly 
> weighted terms. So basically, coord is just trying to reorder things a bit 
> based on reported user expectations.
>
> - Mark
>
>
>
> Ghinwa Choueiter wrote:
>> but shouldn't the coord factor kick in with AND instead of OR? I understand 
>> why you would want to use coord in the case of AND, where you reward more 
>> the documents that contain most of the terms in the query. However in the 
>> case of OR, it should not matter if all the OR  operands are in the 
>> document?
>> 
>> -Ghinwa
>> 
>> ----- Original Message ----- From: "Erik Hatcher" 
>> <er...@ehatchersolutions.com>
>> To: <ja...@lucene.apache.org>
>> Sent: Sunday, March 09, 2008 1:22 PM
>> Subject: Re: Scoring a query with OR's
>> 
>> 
>>> 
>>> On Mar 9, 2008, at 12:39 PM, Ghinwa Choueiter wrote:
>>>> but what exactly happens when there are OR's, for eg.  (life OR  place OR 
>>>> time)
>>>> 
>>>> The scoring equation can get a score for life, place, time  separately, 
>>>> but what does it do with them then? Does it also add them.
>>> 
>>> The coord factor kicks in then:
>>> 
>>> <http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
>>> apache/lucene/search/DefaultSimilarity.html#coord(int,%20int)>
>>> 
>>> the formula listed here should help too:
>>> 
>>> <http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
>>> apache/lucene/search/Similarity.html>
>>> 
>>> Erik
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Scoring a query with OR's

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
With AND, _all_ clauses are required, not just most.   With OR, the  
idea is to reward documents that match more clauses.

	Erik


On Mar 9, 2008, at 1:38 PM, Ghinwa Choueiter wrote:
> but shouldn't the coord factor kick in with AND instead of OR? I  
> understand why you would want to use coord in the case of AND,  
> where you reward more the documents that contain most of the terms  
> in the query. However in the case of OR, it should not matter if  
> all the OR  operands are in the document?
>
> -Ghinwa
>
> ----- Original Message ----- From: "Erik Hatcher"  
> <er...@ehatchersolutions.com>
> To: <ja...@lucene.apache.org>
> Sent: Sunday, March 09, 2008 1:22 PM
> Subject: Re: Scoring a query with OR's
>
>
>>
>> On Mar 9, 2008, at 12:39 PM, Ghinwa Choueiter wrote:
>>> but what exactly happens when there are OR's, for eg.  (life OR   
>>> place OR time)
>>>
>>> The scoring equation can get a score for life, place, time   
>>> separately, but what does it do with them then? Does it also add  
>>> them.
>>
>> The coord factor kicks in then:
>>
>> <http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc// 
>> org/ apache/lucene/search/DefaultSimilarity.html#coord(int,%20int)>
>>
>> the formula listed here should help too:
>>
>> <http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc// 
>> org/ apache/lucene/search/Similarity.html>
>>
>> Erik
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Scoring a query with OR's

Posted by Mark Miller <ma...@gmail.com>.
I have been trying to understand all of this better myself, so while I 
am no expert, here is my take:

Lucene is really a combined Vector Space / Boolean Model search engine.

At its core, Lucene is essentially a Vector Space Model search engine: 
scoring is done by comparing a query term vector to each of the document 
term vectors. However, on top of this, Lucene allows a Boolean Model by 
constraining results using a BooleanQuery.

So when Lucene finds the score for "mark OR mandy", the idea is the same 
as for "mark AND mandy". The difference is that the BooleanQuery will 
treat the Must and Should clause differently: if a term is labeled Must 
but is not in the document, the document won't match. If a Should term 
is not in the document, the BooleanQuery excludes no extra documents on 
that account, but the term may contribute 0 towards the similarity 
score. The BooleanQuery kind of clamps down on top of the Vector Space 
TermVector similarity scoring, allowing for a hybrid system.

The coord factor essentially juices the term vector similarity score 
based on how many query terms are in the document. Term overlap is 
already taken into account during the term vector similarity part, but 
apparently users don't like how that ranks eg users intuitively think 
that sharing more terms between document and query is more important 
than sharing fewer very highly weighted terms. So basically, coord is 
just trying to reorder things a bit based on reported user expectations.

- Mark



Ghinwa Choueiter wrote:
> but shouldn't the coord factor kick in with AND instead of OR? I 
> understand why you would want to use coord in the case of AND, where 
> you reward more the documents that contain most of the terms in the 
> query. However in the case of OR, it should not matter if all the OR  
> operands are in the document?
>
> -Ghinwa
>
> ----- Original Message ----- From: "Erik Hatcher" 
> <er...@ehatchersolutions.com>
> To: <ja...@lucene.apache.org>
> Sent: Sunday, March 09, 2008 1:22 PM
> Subject: Re: Scoring a query with OR's
>
>
>>
>> On Mar 9, 2008, at 12:39 PM, Ghinwa Choueiter wrote:
>>> but what exactly happens when there are OR's, for eg.  (life OR  
>>> place OR time)
>>>
>>> The scoring equation can get a score for life, place, time  
>>> separately, but what does it do with them then? Does it also add them.
>>
>> The coord factor kicks in then:
>>
>> <http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
>> apache/lucene/search/DefaultSimilarity.html#coord(int,%20int)>
>>
>> the formula listed here should help too:
>>
>> <http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
>> apache/lucene/search/Similarity.html>
>>
>> Erik
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Scoring a query with OR's

Posted by Ghinwa Choueiter <gh...@csail.mit.edu>.
but shouldn't the coord factor kick in with AND instead of OR? I understand 
why you would want to use coord in the case of AND, where you reward more 
the documents that contain most of the terms in the query. However in the 
case of OR, it should not matter if all the OR  operands are in the 
document?

-Ghinwa

----- Original Message ----- 
From: "Erik Hatcher" <er...@ehatchersolutions.com>
To: <ja...@lucene.apache.org>
Sent: Sunday, March 09, 2008 1:22 PM
Subject: Re: Scoring a query with OR's


>
> On Mar 9, 2008, at 12:39 PM, Ghinwa Choueiter wrote:
>> but what exactly happens when there are OR's, for eg.  (life OR  place OR 
>> time)
>>
>> The scoring equation can get a score for life, place, time  separately, 
>> but what does it do with them then? Does it also add them.
>
> The coord factor kicks in then:
>
> <http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
> apache/lucene/search/DefaultSimilarity.html#coord(int,%20int)>
>
> the formula listed here should help too:
>
> <http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
> apache/lucene/search/Similarity.html>
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Scoring a query with OR's

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Mar 9, 2008, at 12:39 PM, Ghinwa Choueiter wrote:
> but what exactly happens when there are OR's, for eg.  (life OR  
> place OR time)
>
> The scoring equation can get a score for life, place, time  
> separately, but what does it do with them then? Does it also add them.

The coord factor kicks in then:

<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
apache/lucene/search/DefaultSimilarity.html#coord(int,%20int)>

the formula listed here should help too:

<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
apache/lucene/search/Similarity.html>

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Scoring a query with OR's

Posted by Ghinwa Choueiter <gh...@csail.mit.edu>.
Hi,

I had a look at the scoring equation and read the scoring online document:
http://lucene.apache.org/java/docs/scoring.html#Scoring

It is clear to me how the scoring equation would work for a query that 
contains AND: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html

but what exactly happens when there are OR's, for eg.  (life OR place OR 
time)

The scoring equation can get a score for life, place, time separately, but 
what does it do with them then? Does it also add them.

Many thanks,
-Ghinwa 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org