You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2014/12/08 00:44:12 UTC
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237320#comment-14237320 ]
ASF GitHub Bot commented on MAHOUT-1493:
----------------------------------------
Github user andrewpalumbo commented on the pull request:
https://github.com/apache/mahout/pull/32#issuecomment-65961642
I’ve had some time to work on this recently and have Naïve Bayes implemented in the math-scala and spark modules with basic CLI drivers (for Spark) and working with the classify-20newsgroups.sh example. It still needs some refactoring and cleanup. In the math-scala implementation, there are some problems with the `NaiveBayes.extractLabelsAndAggregateObservations(…)` function in NaiveBayes.scala. I’ve put together a hack using `mapBlock(…)` which works, but very inefficiently (uses 2 transposes on a large set and ). Not very pretty. As well in order to convert from a String-Keyed to an Int-Keyed DRM it is necessary to collect the entire dataset up front. I’m going to investigate a .toIntKeyed() function for DrmLike though I’m still not all that familiar with the RDD backing. So currently I’m just overriding it in a SparkNaiveBayes object in the spark package.
There are three (trivial) MRLegacy dependencies: `ComplementaryThetaTrainer`, `ResultAnalyzer` and ClassifierResult. I’ll probably port `ComplementaryThetaTrainer` into the Math-Scala module. Maybe `ComplementaryThetaTrainer` and `ResultAnalyzer` are candidates to be moved into Mahout-Math. I’m not sure if we’re trying to keep MRLegacy out of Math-Scala?
A few issues that I still need to work out:
1. A bug in the H20Drm –` H20Helper.java` uses a `water.fvec.Vec` vector to store label keys but `water.fvec.Vec` does not accept String values (throws an error). So we need a new distributed data structure to store String Row Labels for H20 DRMs. Otherwise this should be working for H20.
2. Need a distributed method for conversion for the conversion from String to Int-Keyed DRMs (as mentioned above)
3. As is, the `NBModel` is not fully serializable (several `RandomAccessSparseVectors` fields and a `Matrix` field in the class). I’m still working out a way to broadcast it to the mapBlock closure. Should be an relatively straightforward fix. So NaiveBayes.test(…) is running sequentially for now, and collecting the full test set up front. I haven’t looked at it too closely, but I’m wondering if we can add a `drmBroadcast(…)` method for arbitrary (serializable) Broadcast Objects?
Next steps:
1. Address above issues.
2. Add in more CLI options.
3. Add more tests.
4. Implement NaiveBayes.classifyNew(…) method as outlined in MAHOUT-1564.
If there are no objections, I’ll probably commit this after a bit more cleanup, and after some style fixes, and then work on the next steps from there.
Any input is appreciated.
> Port Naive Bayes to the Spark DSL
> ---------------------------------
>
> Key: MAHOUT-1493
> URL: https://issues.apache.org/jira/browse/MAHOUT-1493
> Project: Mahout
> Issue Type: Bug
> Components: Classification
> Reporter: Sebastian Schelter
> Assignee: Andrew Palumbo
> Fix For: 1.0
>
> Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch
>
>
> Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)