You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2014/12/08 00:44:12 UTC

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL

    [ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237320#comment-14237320 ] 

ASF GitHub Bot commented on MAHOUT-1493:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

    https://github.com/apache/mahout/pull/32#issuecomment-65961642
  
    I’ve had some time to work on this recently and have Naïve Bayes implemented in the math-scala and spark modules with basic CLI drivers (for Spark) and working with the classify-20newsgroups.sh example.  It still needs some refactoring and cleanup. In the math-scala implementation, there are some problems with the `NaiveBayes.extractLabelsAndAggregateObservations(…)` function in NaiveBayes.scala.  I’ve put together a hack using `mapBlock(…)` which works, but very inefficiently (uses 2 transposes on a large set and ).  Not very pretty.  As well in order to convert from a String-Keyed to an Int-Keyed DRM it is necessary to collect the entire dataset up front.   I’m going to investigate a .toIntKeyed() function for DrmLike  though I’m still not all that familiar with the RDD backing. So currently I’m just overriding it in a SparkNaiveBayes object in the spark package. 
    
    There are three (trivial) MRLegacy dependencies:  `ComplementaryThetaTrainer`, `ResultAnalyzer` and  ClassifierResult.  I’ll probably port `ComplementaryThetaTrainer` into the Math-Scala module.  Maybe `ComplementaryThetaTrainer` and `ResultAnalyzer` are candidates to be moved into Mahout-Math.  I’m not sure if we’re trying to keep MRLegacy out of Math-Scala?
    A few issues that I still need to work out:
    
    1.	A bug in the H20Drm –` H20Helper.java` uses a `water.fvec.Vec` vector to store label keys but `water.fvec.Vec` does not accept String values (throws an error).  So we need a new distributed data structure to store String Row Labels for H20 DRMs. Otherwise this should be working for H20. 
    
    2.	Need a distributed method for conversion for the conversion from String to Int-Keyed DRMs (as mentioned above)
    
    3.	As is, the `NBModel` is not fully serializable (several `RandomAccessSparseVectors` fields and a `Matrix` field in the class).  I’m still working out a way to broadcast it to the mapBlock closure. Should be an relatively straightforward fix.  So NaiveBayes.test(…) is running sequentially for now, and collecting the full test set up front.  I haven’t looked at it too closely, but I’m wondering if we can add a `drmBroadcast(…)` method for arbitrary (serializable) Broadcast Objects?  
    
    Next steps:
    
    1.	Address above issues.
    
    2.	Add in more CLI options.
     
    3.	Add more tests.
    
    4.	Implement NaiveBayes.classifyNew(…) method as outlined in MAHOUT-1564. 
    
    
    If there are no objections, I’ll probably commit this after a bit more cleanup, and after some style fixes, and then work on the next steps from there.
    
    Any input is appreciated.



> Port Naive Bayes to the Spark DSL
> ---------------------------------
>
>                 Key: MAHOUT-1493
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>            Reporter: Sebastian Schelter
>            Assignee: Andrew Palumbo
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch
>
>
> Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)