You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Gergő Törcsvári (JIRA)" <ji...@apache.org> on 2014/08/04 17:36:11 UTC

[jira] [Commented] (LUCENE-5699) Lucene classification score calculation normalize and return lists

    [ https://issues.apache.org/jira/browse/LUCENE-5699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084775#comment-14084775 ] 

Gergő Törcsvári commented on LUCENE-5699:
-----------------------------------------

So why good the normalized and normalizedList functions?

First of all, why normalized?
When I first tried to use the Lucene Classification, one of the bigger problem was, that the scores, whats come back means nothing. Basically the classifier returns the class, and a random number. If you have 2 text, and you push them in the classifier, the scores didn't help you  to figure out what result is more trustworthy.
The normalized values have that option. If you want to tell the user how sure are you, the normalized values help you out.

Second, why lists?
If you can tell the user, how sure are you, it's not far that you want to tell them whats are the other options. What are the 3 more relevant or 5 more relevant class.
Most of the classification algorithms have those numbers a prior.

The problem with the normalization and the lists:
Sadly not all classification algorithm have lists, they just drop classes. So it can't go instantly to the api, because some classification method never have list or score.


I have 2 api suggestion:
The first where the Classifier interface get those normalized and normalizedList functions, and some of the implementations drop exceptions if somebody want to use them.
Or, the Classifier interface don't get them, but some classifier can provide these functions.

> Lucene classification score calculation normalize and return lists
> ------------------------------------------------------------------
>
>                 Key: LUCENE-5699
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5699
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: modules/classification
>            Reporter: Gergő Törcsvári
>            Assignee: Tommaso Teofili
>         Attachments: 06-06-5699.patch, 0730.patch, 0803-base.patch
>
>
> Now the classifiers can return only the "best matching" classes. If somebody want it to use more complex tasks he need to modify these classes for get second and third results too. If it is possible to return a list and it is not a lot resource why we dont do that? (We iterate a list so also.)
> The Bayes classifier get too small return values, and there were a bug with the zero floats. It was fixed with logarithmic. It would be nice to scale the class scores sum vlue to one, and then we coud compare two documents return score and relevance. (If we dont do this the wordcount in the test documents affected the result score.)
> With bulletpoints:
> * In the Bayes classification normalized score values, and return with result lists.
> * In the KNN classifier possibility to return a result list.
> * Make the ClassificationResult Comparable for list sorting.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org