You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Andrew Palumbo (JIRA)" <ji...@apache.org> on 2015/03/18 16:13:38 UTC
[jira] [Commented] (MAHOUT-1564) Naive Bayes Classifier for New
Text Documents
[ https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367278#comment-14367278 ]
Andrew Palumbo commented on MAHOUT-1564:
----------------------------------------
I may end up splitting this up into two pieces. Its trivial in the scala mahout, and is almost too abstract a problem to really write a useful job for in MRLegacy- may be more useful as documentation.
> Naive Bayes Classifier for New Text Documents
> ---------------------------------------------
>
> Key: MAHOUT-1564
> URL: https://issues.apache.org/jira/browse/MAHOUT-1564
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.9
> Reporter: Andrew Palumbo
> Assignee: Andrew Palumbo
> Labels: DSL, legacy, scala, spark
> Fix For: 0.10.1, 0.10.0
>
>
> MapReduce and DSL Naive Bayes implementations currently lack the ability to classify a new document (outside of the training/holdout corpus). This New feature will do the following.
> 1. Vectorize a new text document using the dictionary and document frequencies from the training/holdout corpus
> - assume the original corpus was vectorized using `seq2sparse`; step (1) will use all of the same parameters.
> 2. Score and label a new document using a previously trained model.
> This effort will need to be done in parallel for MRLegacy and DSL implementations. Neither should be too much work.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)