You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tommaso Teofili (JIRA)" <ji...@apache.org> on 2014/08/08 09:50:12 UTC

[jira] [Comment Edited] (LUCENE-5699) Lucene classification score calculation normalize and return lists

    [ https://issues.apache.org/jira/browse/LUCENE-5699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090443#comment-14090443 ] 

Tommaso Teofili edited comment on LUCENE-5699 at 8/8/14 7:49 AM:
-----------------------------------------------------------------

thanks Gergő, the patch looks much better.

bq. When I first tried to use the Lucene Classification, one of the bigger problem was, that the scores, whats come back means nothing. Basically the classifier returns the class, and a random number. If you have 2 text, and you push them in the classifier, the scores didn't help you to figure out what result is more trustworthy.

while the classification score doesn't of course return a random number, I agree the score should be normalized, between 0 and 1, the higher the better (basically this resumes in a probability measure).
Regarding the implementation I don't think the API needs to be touched for this, normalized scores should be always returned in _ClassificationResults_ by _Classifier#assignClass_ method implementations.

bq. If you can tell the user, how sure are you, it's not far that you want to tell them whats are the other options. What are the 3 more relevant or 5 more relevant class.

ok, the use case sounds reasonable, however my only concern (which extend to the normalization implementation as it's based on the generation of lists) relates to the fact that the current implementation may not scale well if you have huge number of classes.

Regarding API introduction I would be in favor in introducing something like _Classifier#getClasses(String text)_ returning a _List<ClassificationResult>_ for this use case, in alternative/addition _Classifier#getClasses(String text, int max)_ to filter the maximum number of classes to be returned (as the user is probably interested in the first N classes, rather than the whole list of classes). 



was (Author: teofili):
thanks Gergő, the patch looks much better.

bq. When I first tried to use the Lucene Classification, one of the bigger problem was, that the scores, whats come back means nothing. Basically the classifier returns the class, and a random number. If you have 2 text, and you push them in the classifier, the scores didn't help you to figure out what result is more trustworthy.

while the classification score doesn't of course return a random number, I agree the score should be normalized, between 0 and 1, the higher the better (basically this resumes in a probability measure).
Regarding the implementation I don't think the API needs to be touched for this, normalized scores should be always returned in _ClassificationResult_s by _Classifier#assignClass_ method implementations.

bq. If you can tell the user, how sure are you, it's not far that you want to tell them whats are the other options. What are the 3 more relevant or 5 more relevant class.

ok, the use case sounds reasonable, however my only concern (which extend to the normalization implementation as it's based on the generation of lists) relates to the fact that the current implementation may not scale well if you have huge number of classes.

Regarding API introduction I would be in favor in introducing something like _Classifier#getClasses(String text)_ returning a _List<ClassificationResult>_ for this use case, in alternative/addition _Classifier#getClasses(String text, int max)_ to filter the maximum number of classes to be returned (as the user is probably interested in the first N classes, rather than the whole list of classes). 


> Lucene classification score calculation normalize and return lists
> ------------------------------------------------------------------
>
>                 Key: LUCENE-5699
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5699
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: modules/classification
>            Reporter: Gergő Törcsvári
>            Assignee: Tommaso Teofili
>         Attachments: 06-06-5699.patch, 0730.patch, 0803-base.patch
>
>
> Now the classifiers can return only the "best matching" classes. If somebody want it to use more complex tasks he need to modify these classes for get second and third results too. If it is possible to return a list and it is not a lot resource why we dont do that? (We iterate a list so also.)
> The Bayes classifier get too small return values, and there were a bug with the zero floats. It was fixed with logarithmic. It would be nice to scale the class scores sum vlue to one, and then we coud compare two documents return score and relevance. (If we dont do this the wordcount in the test documents affected the result score.)
> With bulletpoints:
> * In the Bayes classification normalized score values, and return with result lists.
> * In the KNN classifier possibility to return a result list.
> * Make the ClassificationResult Comparable for list sorting.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org