You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Yakob <ja...@opensuse-id.org> on 2010/11/29 14:01:11 UTC

precision and recall in lucene

hello all
I was wondering, if I want to measure precision and recall in lucene
then what's the best way for me to do it? is there any sample cource
code that I can use?

thanks though
-- 
http://jacobian.web.id

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: precision and recall in lucene

Posted by Erick Erickson <er...@gmail.com>.

Well, I guess I can answer your original question with "no". There's
no Lucene method that will give you these because they aren't
defined. If you can answer the question "given a corpus and a set
of queries and the correct ordering of the relevant documents, how
close does Lucene come to that ordering?", then you could calculate
whether all the docs that should have been returned were found (recall)
and whether the documents returned contained only the documents
that should have been returned (precision).

But a lot of effort in Solr/Lucene tuning is tweaking returned results
to make precision and recall "better", where "better" is understood
relevant to a particular problem space and aren't well defined in
the abstract (and perhaps can't be).

Best
Erick

On Mon, Nov 29, 2010 at 8:40 AM, Yakob <ja...@opensuse-id.org> wrote:

> On 11/29/10, Erick Erickson <er...@gmail.com> wrote:
> > Define precision. Define recall. Define measure <G>....
> >
> > Sorry to give in to my impulses, but this question is so broad it's
> > unanswerable. Try looking at the Text REtrieval Conference for instance.
> > Lots of very bright people spend significant amounts of their careers
> > trying to just define what these mean. Much less how to measure them.
> >
> > And what "good" precision and recall are varies
> > with the search space. And the users. An academic researcher may
> > be willing to spend days finding the one paper out there that speaks
> > to a very specific question. Your average web user won't click past
> > the 2nd page, maybe not the 1st.
> >
> > So perhaps you can tell us what it is you want these measures
> > for and maybe we can come up with some answers that are actually
> > helpful...
> >
> > Best
> > Erick
>
> well when I read the ebook of "lucene in action" I came across this
> sentence.
>
> "Searching is the process of looking up words in an index to find
> documents where
> they appear. The quality of a search is typically described using precision
> and
> recall metrics. Recall measures how well the search system finds
> relevant documents,
> whereas precision measures how well the system filters out the irrelevant
> documents."
>
> I am just thinking of how to measure the precision and recall metrics
> in lucene? I mean I just wanted to do an analysis of precision and
> recall in my thesis that happened to use lucene as the framework. :-)
>
>
>
> --
> http://jacobian.web.id
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: precision and recall in lucene

Posted by Yakob <ja...@opensuse-id.org>.

On 11/29/10, Erick Erickson <er...@gmail.com> wrote:
> Define precision. Define recall. Define measure <G>....
>
> Sorry to give in to my impulses, but this question is so broad it's
> unanswerable. Try looking at the Text REtrieval Conference for instance.
> Lots of very bright people spend significant amounts of their careers
> trying to just define what these mean. Much less how to measure them.
>
> And what "good" precision and recall are varies
> with the search space. And the users. An academic researcher may
> be willing to spend days finding the one paper out there that speaks
> to a very specific question. Your average web user won't click past
> the 2nd page, maybe not the 1st.
>
> So perhaps you can tell us what it is you want these measures
> for and maybe we can come up with some answers that are actually
> helpful...
>
> Best
> Erick

well when I read the ebook of "lucene in action" I came across this sentence.

"Searching is the process of looking up words in an index to find
documents where
they appear. The quality of a search is typically described using precision and
recall metrics. Recall measures how well the search system finds
relevant documents,
whereas precision measures how well the system filters out the irrelevant
documents."

I am just thinking of how to measure the precision and recall metrics
in lucene? I mean I just wanted to do an analysis of precision and
recall in my thesis that happened to use lucene as the framework. :-)

-- 
http://jacobian.web.id

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: precision and recall in lucene

Posted by Erick Erickson <er...@gmail.com>.

Define precision. Define recall. Define measure <G>....

Sorry to give in to my impulses, but this question is so broad it's
unanswerable. Try looking at the Text REtrieval Conference for instance.
Lots of very bright people spend significant amounts of their careers
trying to just define what these mean. Much less how to measure them.

And what "good" precision and recall are varies
with the search space. And the users. An academic researcher may
be willing to spend days finding the one paper out there that speaks
to a very specific question. Your average web user won't click past
the 2nd page, maybe not the 1st.

So perhaps you can tell us what it is you want these measures
for and maybe we can come up with some answers that are actually
helpful...

Best
Erick

On Mon, Nov 29, 2010 at 8:01 AM, Yakob <ja...@opensuse-id.org> wrote:

> hello all
> I was wondering, if I want to measure precision and recall in lucene
> then what's the best way for me to do it? is there any sample cource
> code that I can use?
>
> thanks though
> --
> http://jacobian.web.id
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: precision and recall in lucene

Posted by Yakob <ja...@opensuse-id.org>.

On 12/1/10, Robert Muir <rc...@gmail.com> wrote:

>
> you fill the topics files with list of queries, like the lia2 example
> that has a single query for "apache source":
>
> <top>
> <num> Number: 0
> <title> apache source
> <desc> Description:
> <narr> Narrative:
> </top>
>
> then you populate the qrels file with the "answers" for your document
> collection:
>
> #       qnum   0   doc-name     is-relevant
>
> 0        0       apache1.0.txt           1
> 0        0       apache1.1.txt           1
> 0        0       apache2.0.txt           1
>
> this says that these 3 documents are relevant results for the query
> "apache source"
>

OMG, you are really helpful.I just did it.I really think we should be
friend on facebook though.hehe...

thank you. :-)

-- 
http://jacobian.web.id

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: precision and recall in lucene

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Dec 1, 2010 at 7:25 AM, Yakob <ja...@opensuse-id.org> wrote:
> can you give me an example of how to populate the topics file and
> qrels file other than those on the LIA2 sample code? I still don't
> understand of how these 2 files text work anyway. :-)
>
> let me get this straight. I need to fill topics file with any query
> that I want and qrels file with judgement. but what is the meaning of
> judgement in this case?
>

you fill the topics files with list of queries, like the lia2 example
that has a single query for "apache source":

<top>
<num> Number: 0
<title> apache source
<desc> Description:
<narr> Narrative:
</top>

then you populate the qrels file with the "answers" for your document
collection:

#       qnum   0   doc-name     is-relevant

0        0       apache1.0.txt           1
0        0       apache1.1.txt           1
0        0       apache2.0.txt           1

this says that these 3 documents are relevant results for the query
"apache source"

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: precision and recall in lucene

Posted by Yakob <ja...@opensuse-id.org>.

On 12/1/10, Robert Muir <rc...@gmail.com> wrote:

>
> well you can't use those files with your own document collection.
> you need to populate the topics file with queries that you care about
> measuring.
> then you need to populate the qrels file with judgements for each
> query,  *for your collection*. you are saying this set of documents is
> relevant as a search result to this query.
>
> ---------------------------------------------------------------------

can you give me an example of how to populate the topics file and
qrels file other than those on the LIA2 sample code? I still don't
understand of how these 2 files text work anyway. :-)

let me get this straight. I need to fill topics file with any query
that I want and qrels file with judgement. but what is the meaning of
judgement in this case?

-- 
http://jacobian.web.id

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: precision and recall in lucene

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Dec 1, 2010 at 5:53 AM, Yakob <ja...@opensuse-id.org> wrote:
>
> well yes your information is really helpful.I did find a topics and
> qrels file that come in /src/lia/benchmark in the LIA2 sample code.
> and the result did change slightly.but the precision and recall value
> is still zero. I did also happen to use QueryDriver as you're
> suggesting and the result is also the same as precision and recall is
> still zero. I really need the value of precision and recall to be
> anything but zero
>

well you can't use those files with your own document collection.
you need to populate the topics file with queries that you care about measuring.
then you need to populate the qrels file with judgements for each
query,  *for your collection*. you are saying this set of documents is
relevant as a search result to this query.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: precision and recall in lucene

Posted by Yakob <ja...@opensuse-id.org>.

On 11/30/10, Robert Muir <rc...@gmail.com> wrote:
> On Tue, Nov 30, 2010 at 10:46 AM, Yakob <ja...@opensuse-id.org> wrote:
>
>> can you tell me what went wrong? what is the difference between
>> topicsFile and qrelsFile anyway?
>>
>
> well its hard to tell what you are supplying as topics and qrels.
> have a look at /src/lia/benchmark in the LIA2 sample code: it has an
> example topic and 3 rows in qrels for it.
>
> basically, the topicsFile contains the "queries", and the qrelsFile
> contains judgements as to which documents are relevant.
>
> P.S.: once you have an index, a topics, and a qrels file, you can just
> use org.apache.lucene.benchmark.quality.trec.QueryDriver (it has a
> main method)
>

well yes your information is really helpful.I did find a topics and
qrels file that come in /src/lia/benchmark in the LIA2 sample code.
and the result did change slightly.but the precision and recall value
is still zero. I did also happen to use QueryDriver as you're
suggesting and the result is also the same as precision and recall is
still zero. I really need the value of precision and recall to be
anything but zero

I also put this question in stackoverflow so you can check this link
if you wanted to know more of my problem.

http://t.co/t0hat0T

so any further advice would be helpful though :-)
-- 
http://jacobian.web.id

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: precision and recall in lucene

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Nov 30, 2010 at 10:46 AM, Yakob <ja...@opensuse-id.org> wrote:

> can you tell me what went wrong? what is the difference between
> topicsFile and qrelsFile anyway?
>

well its hard to tell what you are supplying as topics and qrels.
have a look at /src/lia/benchmark in the LIA2 sample code: it has an
example topic and 3 rows in qrels for it.

basically, the topicsFile contains the "queries", and the qrelsFile
contains judgements as to which documents are relevant.

P.S.: once you have an index, a topics, and a qrels file, you can just
use org.apache.lucene.benchmark.quality.trec.QueryDriver (it has a
main method)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: precision and recall in lucene

Posted by Yakob <ja...@opensuse-id.org>.

On 11/30/10, Robert Muir <rc...@gmail.com> wrote:

>
> Have a look at contrib/benchmark under the
> org.apache.lucene.benchmark.quality package.
> There is code (for example
> org.apache.lucene.benchmark.quality.trec.QueryDriver) that can run an
> experiment and output what you need for trec_eval.exe
> I think there is some additional documentation on how to use this in
> Lucene in Action 2.
>
> ---------------------------------------------------------------------


yes you're right.I just realize it.I did find the sample source code
of precision and recall in that ebook such as here.

public class PrecisionRecall {
        public static void main(String[] args) throws Throwable {
                File topicsFile = new
File("C:/Users/Raden/Documents/lucene/LuceneHibernate/adi/50.txt");
                File qrelsFile = new
File("C:/Users/Raden/Documents/lucene/LuceneHibernate/adi/51.txt");
                Directory dir = FSDirectory.open(new
File("C:/Users/Raden/Documents/lucene/LuceneHibernate/adi"));
                Searcher searcher = new IndexSearcher(dir, true);

                String docNameField = "filename";

                PrintWriter logger = new PrintWriter(System.out, true);

                TrecTopicsReader qReader = new TrecTopicsReader();
                QualityQuery qqs[] = qReader.readQueries(new BufferedReader(
                                new FileReader(topicsFile)));

                Judge judge = new TrecJudge(new BufferedReader(
                                new FileReader(qrelsFile)));

                judge.validateData(qqs, logger);

                QualityQueryParser qqParser = new SimpleQQParser(
                                "title", "contents");

                QualityBenchmark qrun = new QualityBenchmark(qqs,
                                qqParser, searcher, docNameField);

                SubmissionReport submitLog = null;
                QualityStats stats[] = qrun.execute(judge,
                submitLog, logger);

                QualityStats avg =
                        QualityStats.average(stats);
                        avg.log("SUMMARY",2,logger, " ");
                        dir.close();
        }

        }

or in a more friendly format you can see it in this link

http://pastebin.ca/2006780

but when I run that code in lucene,the computation of precision and
recall that I get is zero.here is the result

WARNING: 5 judgments match no query! -
   potentialities
   an
   on
   the
   and


SUMMARY
 Search Seconds:         0.000
 DocName Seconds:        0.000
 Num Points:             0.000
 Num Good Points:        0.000
 Max Good Points:        0.000
 Average Precision:      0.000
 MRR:                    0.000
 Recall:                 0.000


can you tell me what went wrong? what is the difference between
topicsFile and qrelsFile anyway?

thanks though. :-)
-- 
http://jacobian.web.id

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: precision and recall in lucene

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 29, 2010 at 8:01 AM, Yakob <ja...@opensuse-id.org> wrote:
> hello all
> I was wondering, if I want to measure precision and recall in lucene
> then what's the best way for me to do it? is there any sample cource
> code that I can use?
>

Have a look at contrib/benchmark under the
org.apache.lucene.benchmark.quality package.
There is code (for example
org.apache.lucene.benchmark.quality.trec.QueryDriver) that can run an
experiment and output what you need for trec_eval.exe
I think there is some additional documentation on how to use this in
Lucene in Action 2.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org