You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Andrew Palumbo (JIRA)" <ji...@apache.org> on 2014/12/22 02:06:13 UTC

[jira] [Comment Edited] (MAHOUT-1493) Port Naive Bayes to the Spark DSL

    [ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255330#comment-14255330 ] 

Andrew Palumbo edited comment on MAHOUT-1493 at 12/22/14 1:05 AM:
------------------------------------------------------------------

Thanks for looking at this Pat.   My thinking up to now as far as using Sequence files is that the input to Naive Bayes (for now) would be the output from seq2sparse; a <Text,VectorWritable> Sequence file.  This is really just to keep it simple for now and to, and the CLI drivers that I wrote are a rough cut that I basically copied and pasted from your Item-similarity drivers (very cool by the way).   

I'm all for taking other input formats.   Especially since were hopefully moving on from MR seq2sparse soon.  I need to read up on what you did with the Readers and Writers.  

The issue that alot of people are having on the list with MRLegacy NB (if I'm think of the issue same as you mentioned) is that each Vectorized Document has to contain a Category in its Key.  Since the Mahout Process for text vectorization has been to seperate documents into directorys by their respective categories, then run `mahout seqdirectory` which converts each document into into a sequence file with \directory\doc_id as a key then seq2sparse, ... ect... So since the category extraction step in MRLegacy NB was hardcoded as a split on "\" and labeling the document as with the first token after the split (the directory id). I've kind of taken this as the convention.  

Internally the documents ids are discarded and the TF-IDF weights are aggregated by category and then individually discarded. 

I've tried to relax the Category extraction convention in the DSL Naive Bayes by allowing the -user-  _developer_ the ability to pass an arbitrary `String => String`  function in the aggregation constructor.  This way the Labels can be extracted by e.g. a regex pattern.  I'd like to incorporate this as an option into the CLI driver but haven't really made it this far yet.  

So I think that with any format, there may be some confusion over the document labeling.   But I agree that we should support other file formats as input.  





was (Author: andrew_palumbo):
Thanks for looking at this Pat.   My thinking up to now as far as using Sequence files is that the input to Naive Bayes (for now) would be the output from seq2sparse; a <Text,VectorWritable> Sequence file.  This is really just to keep it simple for now and to, and the CLI drivers that I wrote are a rough cut that I basically copied and pasted from your Item-similarity drivers (very cool by the way).   

I'm all for taking other input formats.   Especially since were hopefully moving on from MR seq2sparse soon.  I need to read up on what you did with the Readers and Writers.  

The issue that alot of people are having on the list with MRLegacy NB (if I'm think of the issue same as you mentioned) is that each Vectorized Document has to contain a Category in its Key.  Since the Mahout Process for text vectorization has been to seperate documents into directorys by their respective categories, then run `mahout seqdirectory` which converts each document into into a sequence file with \directory\doc_id as a key then seq2sparse, ... ect... So since the category extraction step in MRLegacy NB was hardcoded as a split on "\" and labeling the document as with the first token after the split (the directory id). I've kind of taken this as the convention.  

Internally the documents ids are discarded and the TF-IDF weights are aggregated by category and then individually discarded. 

I've tried to relax the Category extraction convention in the DSL Naive Bayes by allowing the user the ability to pass an arbitrary `String => String`  function in the aggregation constructor.  This way the Labels can be extracted by e.g. a regex pattern.  I'd like to incorporate this as an option into the CLI driver but haven't really made it this far yet.  

So I think that with any format, there may be some confusion over the document labeling.   But I agree that we should support other file formats as input.  




> Port Naive Bayes to the Spark DSL
> ---------------------------------
>
>                 Key: MAHOUT-1493
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>            Reporter: Sebastian Schelter
>            Assignee: Andrew Palumbo
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch
>
>
> Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)