You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2010/10/04 10:05:33 UTC
[jira] Commented: (MAHOUT-522) Using different data sources for input/output

    [ https://issues.apache.org/jira/browse/MAHOUT-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917512#action_12917512 ] 

Ted Dunning commented on MAHOUT-522:
------------------------------------

If you want a classifier framework that is more API oriented, take a look at the o.a.m.classifier.sgd package.  These provide in-memory sequential, multi-threaded implementations of logistic regression.

Your point about the entire system being a bit limited is spot on.  The Naive Bayes models do have a generic interface for saving models and reloading them, but the Hadoop based parts of Mahout are heavily dependent on file systems which is a characteristic that is easy to inherit from the assumptions of running code with Hadoop.

That said, I am sure that if you have concrete suggestions or, better yet, patches, they would be very welcome.  Don't feel you have to build a final answer, just try something.  IF there is a spark there, we will help start a fire.


> Using different data sources for input/output
> ---------------------------------------------
>
>                 Key: MAHOUT-522
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-522
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Utils
>            Reporter: Oleksandr Petrov
>
> Hi,
> Mahout is currently bound to the file system, at least from my feeling. Most of the time data structures i'm working with aren't located on the file system, the same way as output isn't bound to the FS, most of time i'm forced to export my datasets from DB to FS, and then load them back to DB afterwards.
> Most likely, it's not quite interesting for the core developers, who're working on the algorithms implementation to start writing adapters to DBs or anything like that.
> For instance,  SequenceFilesFromDirectory is a simple way to get your files from directory and convert it all to Sequence Files. Some people would be extremely grateful if there would be an interface they may implement to throw their files from DB straight to the Sequence File without a medium of a File System. If anyone's interested, i can provide a patch.
> Second issue is related to the workflow process itself. For instance, what if i already do have Dictionary, TF-IDF, and TF in some particular format that was created by other things in my infrastructure. Again, I need to convert those to the Mahout data-structures. Can't we just allow other jobs to accept more generic types (or interfaces, for instance) when working with TF-IDF, TF and Dictionaries, without binding those to Hadoop FS. 
> I do realize that Mahout is a part of Lucene/Hadoop infrastructure, but it's also an independent project, so it may benefit and get a more wide adoption, if it allows to work with any format. I do have an idea of how to implement it, and partially implemented it for our infrastructure needs, but i really want to hear some output from users and hadoop developers, whether it's suitable and if anyone may benefit out of that.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.