You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Hui Fang (JIRA)" <ji...@apache.org> on 2007/07/26 04:03:31 UTC

[jira] Created: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Implement a state-of-the-art retrieval function in Lucene
---------------------------------------------------------

                 Key: LUCENE-965
                 URL: https://issues.apache.org/jira/browse/LUCENE-965
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Search
    Affects Versions: 2.2
            Reporter: Hui Fang


We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 

Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
The second document should be more relevant than the first one, because it has more 
occurrences of the query term "w3". But the original test case would require us to rank 
the first document higher than the second one, which is not reasonable. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515803 ] 

Doug Cutting commented on LUCENE-965:
-------------------------------------

> It does seem like calculating the average field length at index time should be relatively cheap. 

Yes, it should.  But if average norm suffices, that can be computed on demand and cached in the Searcher without significantly impacting performance.  The norms need to be read anyway, and averaging them adds only a small constant factor to the cost of reading them.


> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515846 ] 

Grant Ingersoll commented on LUCENE-965:
----------------------------------------

I guess I would not be in favor of a special term, I would rather see it integrated into the file format somehow.  Special terms get deleted, misused, etc.  Plus the avg. doc length is going to be something that is going to need to be updated frequently, right? 

Since we are talking 3.x of Lucene fairly soon anyway (assuming the JDK 1.5 vote passes), this would allow us to make the file format change as well, as long as we can still read prior versions.

Charlie, as for you question about what users value in Lucene, speed or recall and precision, the answer is both.  :-)  Some people care more about speed while others care about p/r.  I think most people that use Lucene have the feeling that the results are good enough in production environments and that we don't always worry about getting every last bit out of TREC (especially since we can't, as a group, test against TREC).  That being said, I would bet most users would be willing to trade off a few percentage points of speed in exchange for the kind of MAP improvements we are talking here.  Especially since we probably can eventually figure out a way to make it as fast anyway, or at least find other things we can speed up.

Correct me if I am wrong, but there are other IR strategies that can use the avg. doc. length, too, right?  So, not to sidetrack too much, but if we do this right, maybe we can also open up the door to other scoring strategies as well without much downside.  Just something to consider.



> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Hui Fang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745608#action_12745608 ] 

Hui Fang commented on LUCENE-965:
---------------------------------

Hello everyone, 

We have re-implemented the retrieval functions in a very different way. The main differences are (1) the average document length will not be computed in the retrieval process as we did the previous implementation, which could make the retrieval process more efficiently and (2) instead of modifying the existing search related classes, we integrate the new retrieval functions through two new classes, i.e., AXTermQuery and. AXTermScorer by extending TermQuery and TermScorer classes.  I think that the current implementation addresses most concerns raised in this discussion threads. 

The source codes and the updated reports of our implementation is now available at http://www.ece.udel.edu/~hfang/LuceneAX.html.   We have implemented two slightly versions for lucene-2.4.1 and lucene-2.9-dev.   We hope that the implementation of the axiomatic retrieval function could be integrated in the releases of the Lucene.    Please feel free to let me know if you have any questions or comments. 

Thanks,
-Hui 

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>             Fix For: 3.0
>
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780607#action_12780607 ] 

Grant Ingersoll commented on LUCENE-965:
----------------------------------------

Hi Hui,

I see you updated your paper on this, have you looked at how this might be implemented given the flexible indexing work under way?  

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>             Fix For: 3.1
>
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745617#action_12745617 ] 

Jason Rutherglen commented on LUCENE-965:
-----------------------------------------

The link http://www.ece.udel.edu/~hfang/lucene/lucene-2.9-dev-AX-contrib.tar.gz doesn't work?

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>             Fix For: 3.0
>
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Charlie Zhao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515783 ] 

Charlie Zhao commented on LUCENE-965:
-------------------------------------

Hello Hui:

Thank you for contributing your axiomatic retrieval function to Lucene. Can not wait for the test drive.

Would you please report your settings for your experiment on

Collection     Function                 MAP     P5     P10     P20     P100     NumRR
ROBUST04    Lucene Default     0.048     0.12 0.09     0.08     0.05     21

Since there are disparities comparing with mine.

num_q     249
num_ret     239436
num_rel     17412
num_rel_ret     9780
map     0.2076
gm_ap     0.1049
R-prec     0.2551
bpref     0.2189
recip_rank     0.5684
ircl_prn.0.00     0.6288
ircl_prn.0.10     0.4459
ircl_prn.0.20     0.3562
ircl_prn.0.30     0.2864
ircl_prn.0.40     0.2289
ircl_prn.0.50     0.1925
ircl_prn.0.60     0.145
ircl_prn.0.70     0.1062
ircl_prn.0.80     0.0702
ircl_prn.0.90     0.0461
ircl_prn.1.00     0.0261
P5     0.3944
P10     0.3598
P15     0.3307
P20     0.307
P30     0.2657
P100     0.1618
P200     0.1117
P500     0.0635
P1000     0.0393

Before we go further, let us make sure we are in the same page.

Here is my setting:

    Data: TREC Disk 4 & 5; 528,155 documents; 1,904 MB of text.

    Query Number: TREC Query Number 301-700

    Query Field: <title> only

    IR Engine: Lucene 2.0 (need double check, but close:)

    Note: default Lucene similarity function, using title words only.

If we are in the same page, then 0.048 MAP score is terribly low for 301-700, whereas 0.2076 in mine.

Still your axiomatic retrieval function outperformed the default in many other aspects. So if you would like to check your experimental setting, and if my result is more closer to the real default, then we might discover a further improvement with the axiomatic retrieval function. That is my hope. 

Charlie Zhao


> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Hui Fang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529725 ] 

Hui Fang commented on LUCENE-965:
---------------------------------

Hi Charlie, 

I am sorry for the late reply. I just saw your message. I am not sure why your results are different from mine. But your problem setting is same as mine. Did you use any document preprocessing? 

 

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557139#action_12557139 ] 

Otis Gospodnetic commented on LUCENE-965:
-----------------------------------------

Hui - would it be possible to bring this patch up to date, so it's in sync with Lucene 2.3?

Mike McCandless & Co. have made so many changes to the Lucene index format, I get a feeling that avg. doc. length could also make it into the index format at the segment/index level if this patch is revived.


> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515485 ] 

Doron Cohen edited comment on LUCENE-965 at 7/25/07 8:29 PM:
-------------------------------------------------------------

Thanks for contributing this Hui Fang! Very interesting.
I agree with Grant that we should be able to evaluate this in the context of LUCENE-836 - I hope to finalize it pretty soon. 
I looked into the patch and read the short paper referenced and I have a few comments:

1) Interestingly this too makes use of the average document length, as discussed in http://www.gossamer-threads.com/lists/lucene/java-dev/50537
2) The current patch seem out dated comparing to trunk and also contain many changes that are not part of the proposed improvement. You need to run "svn update" to update with trunk (but do "svn stat -u" beforehand to see what is going to be updated and that there are no conflicts, and bkup your code before that just in case...)
3) The AXSimilarity class itself was is not included in the patch (note that you need to "svn add" the new files in order for "svn diff" to include these new files in the patch.
4) On first reading of the patch it seems that the avarage length is computed at search time for each scored term... right? This is good enough for the evaluation of this Similarity function, but for a running solution a better performance method would be required, like the one Hoss suggested in http://www.gossamer-threads.com/lists/lucene/java-dev/5053



 was:
Thanks for contributing this Hui Fang! Very interesting.
I agree with Grant that we should be able to asses this in the context of LUCENE-836 - I hope to finalize it pretty soon. 
I looked into the patch and read the short paper referenced and I have a few comments:

1) Interestingly this too makes use of the average document length, as discussed in http://www.gossamer-threads.com/lists/lucene/java-dev/50537
2) The current patch seem out dated comparing to trunk and also contain many changes that are not part of the proposed improvement. You need to run "svn update" to update with trunk (but do "svn stat -u" beforehand to see what is going to be updated and that there are no conflicts, and bkup your code before that just in case...)
3) The AXSimilarity class itself was is not included in the patch (note that you need to "svn add" the new files in order for "svn diff" to include these new files in the patch.
4) On first reading of the patch it seems that the avarage length is computed at search time for each scored term... right? This is good enough for the evaluation of this Similarity function, but for a running solution a better performance method would be required, like the one Hoss suggested in http://www.gossamer-threads.com/lists/lucene/java-dev/5053


> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by crspan <cr...@gmail.com>.

Hello Hui:

Thank you for contributing your axiomatic retrieval function to Lucene. 
Can not wait for the test drive :)

Would you please report your setting for your experiment on

Collection Function MAP P@5 P@10 P@20 P@100 NumRR
ROBUST04 Lucene Default 0.048 0.12 0.09 0.08 0.05 21

Since there is a disparity comparing with mine.

num_q 	249
num_ret 	239436
num_rel 	17412
num_rel_ret 	9780
map 	0.2076
gm_ap 	0.1049
R-prec 	0.2551
bpref 	0.2189
recip_rank 	0.5684
ircl_prn.0.00 	0.6288
ircl_prn.0.10 	0.4459
ircl_prn.0.20 	0.3562
ircl_prn.0.30 	0.2864
ircl_prn.0.40 	0.2289
ircl_prn.0.50 	0.1925
ircl_prn.0.60 	0.145
ircl_prn.0.70 	0.1062
ircl_prn.0.80 	0.0702
ircl_prn.0.90 	0.0461
ircl_prn.1.00 	0.0261
P5 	0.3944
P10 	0.3598
P15 	0.3307
P20 	0.307
P30 	0.2657
P100 	0.1618
P200 	0.1117
P500 	0.0635
P1000 	0.0393



Before I go further, let us make sure we are in the same page.

Here is my setting:

Data: TREC Disk 4 & 5; 528,155 documents; 1,904 MB of text.
Query Number: TREC Query Number 301-700
Query Field: <title> only
IR Engine: Lucene 2.0 (need double check, but close:)
Note: default Lucene similarity function, using title words only.

If we are in the same page, then 0.048 MAP score is terribly low for 
301-700, whereas 0.2076 in mine.

Still your axiomatic retrieval function outperformed the default in many 
other aspects. So if you would like to check your experimental setting, 
and if my result is more closer to the real default, then we might 
discover a further improvement with the axiomatic retrieval function. 
That is my hope.

Charlie Zhao


>
>   
>> Implement a state-of-the-art retrieval function in Lucene
>> ---------------------------------------------------------
>>
>>                 Key: LUCENE-965
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>          Components: Search
>>    Affects Versions: 2.2
>>            Reporter: Hui Fang
>>         Attachments: axiomaticFunction.patch
>>
>>
>> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
>> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
>> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Hui Fang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515514 ] 

Hui Fang commented on LUCENE-965:
---------------------------------

Hi Grant and Doron, 

Thank you very much for your comments! They are very useful. I agree that it would be interesting to evaluate this in the context of Lucene-836, which is a very nice idea. Actually, my advisor and I also discussed that we could put some evaluation scripts in Lucene so that others could easily evaluate the retrieval performance. Hope that Lucene-836 would be finalized soon, and please let me know if there is anything I could help. 

Regarding to the speed, the axiomatic retrieval function should have the same computatlonal complexity as the default function if we could compute the average document length at the indexing time instead of search time.  As Doron pointed out, my current implementation is not optimal, I will fix this problem and other svn related problems as soon as possible, and resubmit a new patch. 

Thanks,
-Hui

 

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516507 ] 

Doug Cutting commented on LUCENE-965:
-------------------------------------

> Is there a way to plug in a patch into my local source repository, so I can diff with my favorite diff tool?

patch -p 0 < foo.patch


> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515482 ] 

Grant Ingersoll commented on LUCENE-965:
----------------------------------------

What do people make of this?  Interesting claims.  I haven't looked at the patch yet or read up on the Axiomatic retrieval model, but the precision numbers in the report are impressive.  I think it dovetails nicely with Doron and Chris' discussions on retrieval performance and making better efforts to gauge Lucene's retrieval effectiveness.  These numbers are for TREC and that doesn't always correlate to the real world, but still, not to be discounted, either.

I think it would be cool to see a couple things out of this (at least):
1. contrib/benchmark algorithms to be applied for before and after, including LUCENE-836.  This would give everyone a way of easily evaluating (assuming they have TREC data).  I would wait for 836 to be committed, though, as it is not final yet.
2. Search speed numbers comparing the two approaches.  That is if it is significantly slower, than it probably isn't going to be the default way of doing things

My gut reaction would be, if everything checks out of course, is to see how to factor it in as a separate querying mechanism, if possible like the Spans functionality, to give people the option of using this and if the claims hold up in the wild and feedback is positive, then we could look to making it the default approach.  Not sure how feasible this is, though

In the meantime, looks like I've got some reading to do...

Cheers,
Grant

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by Grant Ingersoll <gs...@apache.org>.

I don't think the original authors have followed up on this patch at  
all since first posting.

On Nov 27, 2008, at 6:44 AM, Ian Holsman (JIRA) wrote:

>
>    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651332 
> #action_12651332 ]
>
> Ian Holsman commented on LUCENE-965:
> ------------------------------------
>
> It's a bit late over here, but when I try to apply the patch it  
> doesn't seem to have the AXSimilarity class in it.
> is there a file missing here, or should i not be looking at applying  
> patches late at night?
>
>> Implement a state-of-the-art retrieval function in Lucene
>> ---------------------------------------------------------
>>
>>                Key: LUCENE-965
>>                URL: https://issues.apache.org/jira/browse/LUCENE-965
>>            Project: Lucene - Java
>>         Issue Type: Improvement
>>         Components: Search
>>   Affects Versions: 2.2
>>           Reporter: Hui Fang
>>            Fix For: 3.0
>>
>>        Attachments: axiomaticFunction.patch
>>
>>
>> We implemented the axiomatic retrieval function, which is a state- 
>> of-the-art retrieval function, to
>> replace the default similarity function in Lucene. We compared the  
>> performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf 
>> .
>> The report shows that the performance of the axiomatic retrieval  
>> function is much better than the default function. The axiomatic  
>> retrieval function is able to find more relevant documents and  
>> users can see more relevant documents in the top-ranked documents.  
>> Incorporating such a state-of-the-art retrieval function could  
>> improve the search performance of all the applications which were  
>> built upon Lucene.
>> Most changes related to the implementation are made in  
>> AXSimilarity, TermScorer and TermQuery.java.  However, many test  
>> cases are hand coded to test whether the implementation of the  
>> default function is correct. Thus, I also made the modification to  
>> many test files to make the new retrieval function pass those  
>> cases. In fact, we found that some old test cases are not  
>> reasonable. For example, in the testQueries02 of TestBoolean2.java,
>> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3"  
>> and "w1 w3 xx w2 yy w3".
>> The second document should be more relevant than the first one,  
>> because it has more
>> occurrences of the query term "w3". But the original test case  
>> would require us to rank
>> the first document higher than the second one, which is not  
>> reasonable.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Ian Holsman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651332#action_12651332 ] 

Ian Holsman commented on LUCENE-965:
------------------------------------

It's a bit late over here, but when I try to apply the patch it doesn't seem to have the AXSimilarity class in it.
is there a file missing here, or should i not be looking at applying patches late at night?

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>             Fix For: 3.0
>
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515773 ] 

Yonik Seeley commented on LUCENE-965:
-------------------------------------

It does seem like calculating the average field length at index time should be relatively cheap.
I haven't seen the Similarity implementation, but the axiomatic TermScorer.score() will be somewhat slower than Lucene's due to the necessary division (all but one can be turned into a multiply I think).



> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Charlie Zhao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516185 ] 

Charlie Zhao commented on LUCENE-965:
-------------------------------------


Regarding the approach to compute avgDL, this patch goes like this: 

+    float avgDL=0.0f;
+    for (int i=0; i<norms.length;i++) {
+        avgDL += normDecoder[norms[i] & 0xFF];
+    }
+    avgDL /= norms.length * 1.0f;

But may I suggest the alternative? 

      float CL = 0.0f;
      float avgDL = 0.0f;
      float aDL = 0.0f; 
      for (int i=0; i<norms.length;i++) {
        aDL = 1.0f / normDecoder[norms[i] & 0xFF] ; 
        aDL *= aDL;
        CL += aDL;
      }
      avgDL = CL / norms.length;    

Let us see a toy example:

2 docs in index

	|D|	avgD		|D|/avgD	norm	avgNorm 	norm/avgNorm		
D1	4	10		2/5		1/2	3/8		4/3
D2	16	10		8/5		1/4	3/8		2/3

norm/avgNorm is what we got from the patch code and D1>D2

|D|/avgD is what we got from the suggested alternative code and D1 < D2

They have totally flipped the relationship between D1 and D2. 

My impression of the Axiomatic Retrieval Function is: it still tries to penalize longer doc. So maybe the alternative code is what we need? 

By the same token, |D| != Similarity.decodeNorm(fieldNorms[doc]). 

Note: since we are recovering from the norm, so avgDL and DL != their original absolute value. But they suffice for the scoring purpose. 

Based on Doug's previous comment, I totally agree that avgDL should be pre-computed and cached in the searcher before where the rubber meets the road. And the cost might be invisible if we warm up the searcher first. Thanks for explaining. 

Not sure where Doron implemented 1 / sqrt((1 - Slope) * Pivot + (Slope) * Doclen). Since LUCENE-836 looks will be committed soon. I am really excited to see which similarity function will prevail in this era.  

BTW, anyone would like to share how to read Lucene patches more efficiently? I mean I had hard time to make sense of those +s and -s independently from their source files. Is there a way to plug in a patch into my local source repository, so I can diff with my favorite diff tool? Thanks in advance. 


> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Charlie Zhao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515833 ] 

Charlie Zhao commented on LUCENE-965:
-------------------------------------



Document Length and Average Document Length are sort of speed bottlenecks of  Lucene's implementation of some IR models, like Axiomatic Retrieval Function we just saw and one Language Model I have extended in Lucene. I said speed, instead of performance. Because Lucene's performance measures (in the sense of recall and precision) are relatively low comparing with other IR models with my experimental results. And since early Lucene, we never updated the kernel of similarity measure algorithm. Do general users value (recall+precision) more than (speed)? 

How to conveniently store and retrieve "field length", "document length", "average document length", etc.? Can they be the payload data at document level and index level? So we may say bye to their corresponding overhead during query time? 

I used to leverage from TermFreqVector's getTermFrequencies() to obtain the field length. (size() only return the unique terms)  But shall I just reverse that field's norm value back to its length as (1/norm)^2? Which might be faster. Can someone confirm this?

BTW, I need help to understand the claim of "a small constant factor to the cost of reading them." in Doug's comment. Average norm does not give us the average field length. We need to recover the individual field length to get the average field length, which involve a great deal of floating point operations there. Did I miss something?

Can we store the "document length" (with multiple fields) and "average document length" as the payload data at document level and index level respectively? The current payload is designed at term level, is it right? If we want to store something at document and index level, do we necessary change the Lucene file format? 





> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515485 ] 

Doron Cohen commented on LUCENE-965:
------------------------------------

Thanks for contributing this Hui Fang! Very interesting.
I agree with Grant that we should be able to asses this in the context of LUCENE-836 - I hope to finalize it pretty soon. 
I looked into the patch and read the short paper referenced and I have a few comments:

1) Interestingly this too makes use of the average document length, as discussed in http://www.gossamer-threads.com/lists/lucene/java-dev/50537
2) The current patch seem out dated comparing to trunk and also contain many changes that are not part of the proposed improvement. You need to run "svn update" to update with trunk (but do "svn stat -u" beforehand to see what is going to be updated and that there are no conflicts, and bkup your code before that just in case...)
3) The AXSimilarity class itself was is not included in the patch (note that you need to "svn add" the new files in order for "svn diff" to include these new files in the patch.
4) On first reading of the patch it seems that the avarage length is computed at search time for each scored term... right? This is good enough for the evaluation of this Similarity function, but for a running solution a better performance method would be required, like the one Hoss suggested in http://www.gossamer-threads.com/lists/lucene/java-dev/5053


> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Hui Fang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745995#action_12745995 ] 

Hui Fang commented on LUCENE-965:
---------------------------------

Jason,  the problem has been fixed.  Please try again.  Thanks. 



> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>             Fix For: 3.0
>
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630820#action_12630820 ] 

Mark Miller commented on LUCENE-965:
------------------------------------

Lets not forget about trying to get avg doc length in by 3.0 -if it can be done with little/to no impact on non users of it, would be really cool to have.

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515854 ] 

Michael Busch commented on LUCENE-965:
--------------------------------------

> I guess I would not be in favor of a special term, I would rather see it integrated 
> into the file format somehow. Special terms get deleted, misused, etc. 

Well yes, I would also prefer to have real per-doc payloads in the file format, but
until we have that we can use this workaround to try things out, as the performance
should be comparable to real per-doc payloads.

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Hui Fang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hui Fang updated LUCENE-965:
----------------------------

    Attachment: axiomaticFunction.patch

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515850 ] 

Doug Cutting commented on LUCENE-965:
-------------------------------------

> Did I miss something?

What I meant is that the loops added by this patch to compute average document length per query term could be more efficiently computed once per field in a searcher.  They could thus be cached in, e.g., a WeakHashMap<Searcher,Map<String,float[]>>.

The cost of computing these is proportional to the size of the norms, which means that it is proportional to the cost of reading the norms.  Computing them on demand when a searcher is opened would not be as fast as pre-computing them, but it might not prohibitively slow either, and would be simple to implement without other changes to Lucene.

By "average norm" I guess I really meant "easily computable from norms".  This may not always be possible, since, e.g., with boosting, document lengths may not be recoverable from the norms.  But, in many cases, it might suffice.

Does that help?

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516547 ] 

Doron Cohen commented on LUCENE-965:
------------------------------------

> Is there a way to plug in a patch into my local source repository, so I can diff with my favorite diff tool?
: patch -p 0 < foo.patch  

Try with --dry-run first.
Another convenient way in case you are using Eclipse is the Subclipse plugin that lets you visually diff patches just before applying them.

> But may I suggest the alternative? 

I think you have a valid point here. I too don't understand the proposed "Axiomatic Retrieval Function" (ARF) in this regard: in Lucene, the norm value stored for a document (assuming all boosts are 1) is
    norm(D) = 1 / sqrt(numTerms(D))
This value is ready to use at scoring time, multiplying it with  
    tf(t in d)  -   idf(t)^^2   
as described in http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html

Now, the ARF paper in http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf describes Lucene scoring using |D| in place of norm(D) above, and describes ARF scoring using |D| again, the same as it seems to be implemented in this patch e.g. in TermScorer. However, the paper defines |D| as "the length of D". I find this confusing. Usually "|D|" really means the number of words in a document, and "avgdl" would mean the average of all |D|'s in the collection (see for instance "Okapi BM25" in Wikipedia). 

Now, your proposed change is something I can understand - it first translates back norm(D) into Length(D) (ignoring boosts), and only then averaging it. 

In any case - I mean if either this is fixed or I am wrong and an explanation shows why no fix is needed - I have to admit I still don't understand the logic behind ARF, intuitively, why would it be better? Guess provable search quality results can help in persuading...  (LUCENE-836 is resolved btw).

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-965:
---------------------------------

    Fix Version/s:     (was: 3.0)
                   3.1

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>             Fix For: 3.1
>
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515840 ] 

Michael Busch commented on LUCENE-965:
--------------------------------------

> Can we store the "document length" (with multiple fields) and "average document length" 
> as the payload data at document level and index level respectively? The current payload 
> is designed at term level, is it right? If we want to store something at document and 
> index level, do we necessary change the Lucene file format? 

You are right, currently we can only store payloads per term occurrence, not at the doc
level. However, it is possible to simply add a special term to every document that has
only one occurrence with a payload, then you have one payload per doc.

Coincidentally I am currently testing how search performance would suffer if we stored
the norms as payloads in the posting lists. At search time we would then not buffer the
norms but read them on demand from the prx file. This is probably somewhat slower than 
buffering the norms, but has a lot of advantages, such as much simpler code and less 
memory consumption by the IndexReader. Since all norms are then stored in a single 
posting lists I'm hoping that the FS cache will help that the search performance won't
suffer too much. And multi-level skipping should help too. Well let's see, I'm currently
building an index with norms as payloads, I should have some numbers tonight or tomorrow.

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated LUCENE-965:
------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
    Fix Version/s: 3.0

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>             Fix For: 3.0
>
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
> replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.  However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org