You are viewing a plain text version of this content. The canonical link for it is here.
Posted to openrelevance-dev@lucene.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/02/08 15:15:27 UTC

[jira] Created: (ORP-6) Add TREC9 filtering (OHSUMED) collection

Add TREC9 filtering (OHSUMED) collection
----------------------------------------

                 Key: ORP-6
                 URL: https://issues.apache.org/jira/browse/ORP-6
             Project: Open Relevance Project 
          Issue Type: New Feature
          Components: Collections
            Reporter: Andrzej Bialecki 




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ORP-6) Add TREC9 filtering (OHSUMED) collection

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830984#action_12830984 ] 

Robert Muir commented on ORP-6:
-------------------------------

>>  For calculation of metrics that depend on position (such as NDCG) this needs to be taken into account, e.g. by first sorting the qrels by relevance and calculating an Ideal DCG@N, where N is the number of available qrels. 

Andrzej, i looked at a patch to trec_eval to support NDCG and it appears to do this sort itself: http://cio.nist.gov/esd/emaildir/lists/ireval/msg00037.html
I guess the latest version does not support this metric, are people using this patch or is there some other NDCG calculator that does not do this sort???

> Add TREC9 filtering (OHSUMED) collection
> ----------------------------------------
>
>                 Key: ORP-6
>                 URL: https://issues.apache.org/jira/browse/ORP-6
>             Project: Open Relevance Project 
>          Issue Type: New Feature
>          Components: Collections
>            Reporter: Andrzej Bialecki 
>         Attachments: ohsumed.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ORP-6) Add TREC9 filtering (OHSUMED) collection

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830986#action_12830986 ] 

Andrzej Bialecki  commented on ORP-6:
-------------------------------------

bq. OK, I think the best way to handle this is to instead make it easier to run T, T+D, T+D+N, etc queries from the benchmark package.

That would be cool - yes, it's a Lucene benchmark issue.

bq. I thought DCG etc were only based on the '2' versus '1' value in the qrels? I am only vaguely familiar with these so I could be wrong?

http://en.wikipedia.org/wiki/Discounted_Cumulative_Gain unlike the plain Cumulative Gain, discounts the importance of a result by its position on the list of results (rank).

bq. I guess the latest version does not support this metric, are people using this patch or is there some other NDCG calculator that does not do this sort???

No, I stumbled upon this issue when implementing NDCG myself for another project.

Ok, I'll add these remarks and commit. Thanks!

> Add TREC9 filtering (OHSUMED) collection
> ----------------------------------------
>
>                 Key: ORP-6
>                 URL: https://issues.apache.org/jira/browse/ORP-6
>             Project: Open Relevance Project 
>          Issue Type: New Feature
>          Components: Collections
>            Reporter: Andrzej Bialecki 
>         Attachments: ohsumed.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ORP-6) Add TREC9 filtering (OHSUMED) collection

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830977#action_12830977 ] 

Robert Muir commented on ORP-6:
-------------------------------

>> I created separate corpora and qrels for the test and train parts of the original collection. 

I am not familiar with this collection, except that from your README and the original file naming it appears like this is the right thing to do?

>> the Mesh and OHSU topics are very different - e.g. from my experience Mesh topics converted to Lucene queries must include the description, because quite often the most relevant docs don't contain the Mesh term itself. This however makes for very long queries ... 

OK, I think the best way to handle this is to instead make it easier to run T, T+D, T+D+N, etc queries from the benchmark package. I'll open an issue with an initial patch for you to look over (but I dont think this is an ORP problem, just a problem that the benchmark pkg is really only setup to run Title queries right now).

>> AFAIU the definition of the filtering track is that qrels are NOT ranked, they just list relevant docs in random order. For calculation of metrics that depend on position (such as NDCG) this needs to be taken into account, e.g. by first sorting the qrels by relevance and calculating an Ideal DCG@N, where N is the number of available qrels. 

I thought DCG etc were only based on the '2' versus '1' value in the qrels? I am only vaguely familiar with these so I could be wrong?

> Add TREC9 filtering (OHSUMED) collection
> ----------------------------------------
>
>                 Key: ORP-6
>                 URL: https://issues.apache.org/jira/browse/ORP-6
>             Project: Open Relevance Project 
>          Issue Type: New Feature
>          Components: Collections
>            Reporter: Andrzej Bialecki 
>         Attachments: ohsumed.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ORP-6) Add TREC9 filtering (OHSUMED) collection

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830967#action_12830967 ] 

Andrzej Bialecki  commented on ORP-6:
-------------------------------------

Sure, why not. But there are some points that I'm not sure about yet:

* I created separate corpora and qrels for the test and train parts of the original collection.

* the Mesh and OHSU topics are very different - e.g. from my experience Mesh topics converted to Lucene queries must include the description, because quite often the most relevant docs don't contain the Mesh term itself. This however makes for very long queries ...

* AFAIU the definition of the filtering track is that qrels are NOT ranked, they just list relevant docs in random order. For calculation of metrics that depend on position (such as NDCG) this needs to be taken into account, e.g. by first sorting the qrels by relevance and calculating an Ideal DCG@N, where N is the number of available qrels.

I could add these remarks to the README.

> Add TREC9 filtering (OHSUMED) collection
> ----------------------------------------
>
>                 Key: ORP-6
>                 URL: https://issues.apache.org/jira/browse/ORP-6
>             Project: Open Relevance Project 
>          Issue Type: New Feature
>          Components: Collections
>            Reporter: Andrzej Bialecki 
>         Attachments: ohsumed.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ORP-6) Add TREC9 filtering (OHSUMED) collection

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830994#action_12830994 ] 

Robert Muir commented on ORP-6:
-------------------------------

Thanks Andrzej for your work here. I opened LUCENE-2254 for the lucene benchmark issue.

> Add TREC9 filtering (OHSUMED) collection
> ----------------------------------------
>
>                 Key: ORP-6
>                 URL: https://issues.apache.org/jira/browse/ORP-6
>             Project: Open Relevance Project 
>          Issue Type: New Feature
>          Components: Collections
>            Reporter: Andrzej Bialecki 
>         Attachments: ohsumed.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ORP-6) Add TREC9 filtering (OHSUMED) collection

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830958#action_12830958 ] 

Robert Muir commented on ORP-6:
-------------------------------

+1 (built and ran evaluation with training corpus/qrels)

Andrzej, wanna commit this?


> Add TREC9 filtering (OHSUMED) collection
> ----------------------------------------
>
>                 Key: ORP-6
>                 URL: https://issues.apache.org/jira/browse/ORP-6
>             Project: Open Relevance Project 
>          Issue Type: New Feature
>          Components: Collections
>            Reporter: Andrzej Bialecki 
>         Attachments: ohsumed.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ORP-6) Add TREC9 filtering (OHSUMED) collection

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated ORP-6:
--------------------------------

    Attachment: ohsumed.patch

This patch adds support for creating collections from TREC9 / OHSUMED corpus, queries and qrels.

> Add TREC9 filtering (OHSUMED) collection
> ----------------------------------------
>
>                 Key: ORP-6
>                 URL: https://issues.apache.org/jira/browse/ORP-6
>             Project: Open Relevance Project 
>          Issue Type: New Feature
>          Components: Collections
>            Reporter: Andrzej Bialecki 
>         Attachments: ohsumed.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.