You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Robin Anil (JIRA)" <ji...@apache.org> on 2009/12/13 16:01:18 UTC

[jira] Created: (MAHOUT-220) Mahout Bayes Code cleanup

Mahout Bayes Code cleanup
-------------------------

                 Key: MAHOUT-220
                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
    Affects Versions: 0.3
            Reporter: Robin Anil
            Assignee: Robin Anil
             Fix For: 0.2


Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
1.  Line length used is 120 instead of 80. 
2.  static final log is kept as is. not LOG. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794976#action_12794976 ] 

Ted Dunning commented on MAHOUT-220:
------------------------------------


Robin,

I was just looking at some of the code and was having a hard time understanding the way that the implementations of bayes.interfaces.DataSource were storing their information.  I also had trouble understand just what it was that was being stored.

I think that a tiny amount of package or class level comments would clear that up enormously.

My goal in reading the code was to understand how much my recent start on an sgd classifier could share with the already existing Naive bayes classifiers.  I mention that since it alwyas helps me write comments if I know what the question in the reader's mind is that I need to answer. 

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795124#action_12795124 ] 

Jake Mannix commented on MAHOUT-220:
------------------------------------

Ted,

  While I'm totally down with using the randomizer / hashing techniques in places, I don't think we should totally wed ourselves to it either - having the option of using the "real" vector representation should probably be implemented to, as people understand it better, and it's pretty standard.

bq. If you like these, we can promote them to a common area under classifier.

  They might belong in a more general place, actually.  If I'm going to use some of this stuff in the decompositions (although I'm not sure yet of the efficacy of the single hash for doing SVD), it should go somewhere in the math module.

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795122#action_12795122 ] 

Ted Dunning commented on MAHOUT-220:
------------------------------------


Anil,

See classifier.sgd.TermRandomizer (and implementations DenseRandomizer and BinaryRandomizer) for a term list to vector converter.  These are in the MAHOUT-228 patch.

It has the virtue of converting term lists to vectors of fixed size.  It currently does not do term weighting, but that would be a very easy fix.  The approach is roughly along the lines of http://arxiv.org/PS_cache/arxiv/pdf/0902/0902.2206v2.pdf or the stochastic decomposition work.

If you like these, we can promote them to a common area under classifier.

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789907#action_12789907 ] 

Robin Anil commented on MAHOUT-220:
-----------------------------------

Not yet. Still the mvn checkstyle:checkstyle is throwing some errors. Is the checkstyle old there?


> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.2
>
>         Attachments: MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790653#action_12790653 ] 

Isabel Drost commented on MAHOUT-220:
-------------------------------------

Before reorganizing code - could someone who is more familiar with the specific rules of the code-style used at Lucene double-check the exact checkstyle rules used for site-generation? I reused the checkstyle configuration that was already in Mahout-trunk (relaxing some of its rules) but am in doubt whether it really reflects our rules.

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.2
>
>         Attachments: MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795921#action_12795921 ] 

Sean Owen commented on MAHOUT-220:
----------------------------------

I think a utils module remains a good idea, or else core starts to depend on a whole bunch of stuff merely because of some tool code sitting around. It seems right to me.

Math should also not depend on core.

... but back to the issue at hand, literally, is this ready to commit? seems like this was not the original topic of the issue.

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795140#action_12795140 ] 

Jake Mannix commented on MAHOUT-220:
------------------------------------

bq. Robin:  This is a library, our job is to have options for people like us to debate over . So lets agree upon a common mechanism.

Yep, agreed.  We need fully deterministic techniques as well as probabilistic ones (which will often scale better), and let people use what works for them and they are comfortable with.

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795133#action_12795133 ] 

Robin Anil commented on MAHOUT-220:
-----------------------------------

Anyways, I guess we are sounding like ML engineers here. This is a library, our job is to have options for people like us to debate over :). So lets agree upon a common mechanism. 

i.e Have different ways to create a term frequency vector. ie List<String> => SparseVector from documents. 

Once the SparseVector is created. Use uniform M/R jobs to do things like tfidf weighting, log likelihood(although i think we need the orginal file to get the co-occurrence and not the SparseVector)

Any ideas?






> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795136#action_12795136 ] 

Ted Dunning commented on MAHOUT-220:
------------------------------------

{quote}
For sgd algorithm. I suggest you define your own matrix names, row indices and column indices, which your algorithm and your datastore agree upon.
{quote}

That is fine if sgd is an island, but it plausibly should be able to output models to be used by the Bayes classifier in a map-reduce setting.  That requires some documentation of how DataStore is used by the Bayes models.

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-220:
-----------------------------

    Fix Version/s:     (was: 0.2)
                   0.3

Robin is this something I should commit or do you have an updated version?

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795050#action_12795050 ] 

Robin Anil edited comment on MAHOUT-220 at 12/29/09 1:36 PM:
-------------------------------------------------------------

Datastore is an interface which allows you pick a named vector or a named matrix and lookup the cell.  
For Bayes classifier, the entire code is based on tokens and not SparseVectors. The names of the matrix, the row and column are therefore string and the contract between the Algorithm and Datastore is decided per algo. for the Cbayes/Bayes algorithms, We have the HBaseBayesDatastore.java and InMemoryBayesDatastore.java. 

{code}
  double getWeight(String matrixName, String row, String column) throws InvalidDatastoreException;
  double getWeight(String vectorName, String index) throws InvalidDatastoreException;
{code}

For sgd algorithm. I suggest you define your own matrix names, row indices and column indices, which your algorithm and your datastore agree upon.

I know it, this creates a limitation that you cant use integer based column and row names. Maybe we can parameterize it OR change Bayes package to use Vectors instead of the current string token based implementation. 

I am currenly writing a Map/reduce job to convert text documents to vectors without relying on Lucene. Once that is done, I will overhaul the classifier package to use SparseVectors. 

Before that I need to know if this Patch is ok. In terms of code style, I will then patch it and start with the enhancements. 


      was (Author: robinanil):
    Datastore is an interface which allows you pick a named vector or a matrix and lookup the cell.  For Bayes classifier, since the entire code is based on tokens and not SparseVectors. The names of the matrix, the row and column is upto the implementation. for the Cbayes/Bayes algorithms, We have the HBaseBayesDatastore.java and 
InMemoryBayesDatastore.java. 

{code}
  double getWeight(String matrixName, String row, String column) throws InvalidDatastoreException;
  double getWeight(String vectorName, String index) throws InvalidDatastoreException;
{code}

For sgd algorithm. I suggest you define your own matrix names, row indices and column indices, which your algorithm and datastore agree upon.

I know it, this creates a limitation that you can use integer based column and row names. Maybe we can parameterize it OR change Bayes package to use Vectors instead of the current string token based implementation. 

I am currenly writing a Map/reduce job to convert text documents to vectors without relying on Lucene. Once that is done, I will overhaul the classifier package to use SparseVectors. 

Before that I need to know if this Patch is ok. In terms of code style, I will then patch it and start with the enhancements 

  
> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795554#action_12795554 ] 

Jake Mannix commented on MAHOUT-220:
------------------------------------

bq. This does raise the question of whether our modules are serving our needs or we are serving theirs.

Heh, good question.  I really like math *not* depending on core, for lots of reasons (for example, other projects which want our math can import that without needing core, and once I figure out how to do MAHOUT-205 correctly, without needing hadoop at all, which is a big plus).

Why does utils depend on core?  Why is utils in its own module anyways, instead of just being in core?  Is there anything which we imagine will depend on core but not need utils?  Does RandomUtils need to be in core, or can it get pushed down to math?

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795127#action_12795127 ] 

Robin Anil commented on MAHOUT-220:
-----------------------------------

I am not very clear what is happening there when two words have the same hash?. Arent we loosing out on a lot of information ? The one i am proposing is going to do exact numbering of the features. 

One thing my method suffer from is addition of new data. It will take another couple of M/R to create the new dictionary file, while preserving the old ids. Its cumbersome but doable.
What is happening in a Randomizer approach. Since you are fixing the feature set size. The new hash ids will also change when that feature set size increase right?

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795320#action_12795320 ] 

Grant Ingersoll commented on MAHOUT-220:
----------------------------------------

FWIW, I'd say stuff that converts text, etc. to our internal representations belongs in the Utils module, where all the "helper" classes are.

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795117#action_12795117 ] 

Robin Anil commented on MAHOUT-220:
-----------------------------------

A Caching layer is implemented in HbaseDatastore, You can set the cache size. Take a look at MAHOUT-124 for more details

I am just porting the feature mapper and tfidf mapper from bayes classifier common over to make a the new text vectorizer. Take a look at them. Its a fully distributed way of doing tf.idf in 2 map/reduces. 

For the vector convertor
Here is the idea in Steps

M/R1:  Count frequencies of words tokenized using configurable lucene Analyzer
SEQ1: read the frequency list, prune words less than minSupport and create the dictionary file(string => long) and the frequency file (string=>long)
Do map/reduce in chunks by keeping a block of the dictionary file in memory. 
   repeat- M/R2: Run over the input documents. replacing string with the integer id. and create (docid => sparsevector). This sparsevector as weigths as TF. but its incomplete.
Now run a map reduce over the incomplete sparse vectors. Group by docid.In reducer, merge the sparse vectors. 
Initial SparseVectors dataset is ready.

function multiplyIDF(){
M/R3: Calculate DF from the SparseVector dataset
M/R4: Run over the SparseVector TF dataset. and get IDF.
}


This is the first plan. Atleast when i finish. Second is to convert the document into a stream of integers using the dictionary file. Then subsequent funcitons can run M/R jobs to calculate LLR and make bigrams. 

For this. The sparsevector merge MapReduce fucntion should be generic enough. 






> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795603#action_12795603 ] 

Grant Ingersoll commented on MAHOUT-220:
----------------------------------------

Yeah, I don't think Utils should need to depend on core.  I would think it should be:

Most things depend on Math (i.e. Vectors) including core and utils
Utils should be standalone tools for getting things into the appropriate mathematical representation which can be consumed by core, et. al.

bq. Why is utils in its own module anyways, instead of just being in core?

I imagined utils to be the place where things that did useful ancillary tasks lived, such as converting ARFF to Mahout Vector or converting a Lucene index to Mahout Vector or converting a whole slew of raw text to Mahout Vector.  That way the core wouldn't need to be muddied by the various dependencies and could be as lightweight as possible.  So, overtime, Utils may grow to have a dep. on Tika for instance or a pipeline like OpenPipeline, but the core need not know anything about it.

Bleah, indeed!

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-220:
------------------------------

    Attachment: MAHOUT-BAYES.patch

This is the Formatted cleaned up mahout  bayes code based on the MAHOUT-233 checkstyle and Eclipse formatter

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795050#action_12795050 ] 

Robin Anil commented on MAHOUT-220:
-----------------------------------

Datastore is an interface which allows you pick a named vector or a matrix and lookup the cell.  For Bayes classifier, since the entire code is based on tokens and not SparseVectors. The names of the matrix, the row and column is upto the implementation. for the Cbayes/Bayes algorithms, We have the HBaseBayesDatastore.java and 
InMemoryBayesDatastore.java. 

{code}
  double getWeight(String matrixName, String row, String column) throws InvalidDatastoreException;
  double getWeight(String vectorName, String index) throws InvalidDatastoreException;
{code}

For sgd algorithm. I suggest you define your own matrix names, row indices and column indices, which your algorithm and datastore agree upon.

I know it, this creates a limitation that you can use integer based column and row names. Maybe we can parameterize it OR change Bayes package to use Vectors instead of the current string token based implementation. 

I am currenly writing a Map/reduce job to convert text documents to vectors without relying on Lucene. Once that is done, I will overhaul the classifier package to use SparseVectors. 

Before that I need to know if this Patch is ok. In terms of code style, I will then patch it and start with the enhancements 


> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796490#action_12796490 ] 

Robin Anil commented on MAHOUT-220:
-----------------------------------

I am ready to commit the first cut, before moving on to more cleanups.

But this cleanup depends on the codestyle xml diff that I posted. Anyone care to take a look at that ?


> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795135#action_12795135 ] 

Ted Dunning commented on MAHOUT-220:
------------------------------------

{quote}
Robin: I am not very clear what is happening there when two words have the same hash?. Arent we loosing out on a lot of information ? The one i am proposing is going to do exact numbering of the features.
{quote}

That is the point of the "probes" parameter.  That allows for multiple hashing as Jake is suggesting.  If you have, for example, 4 probes for each word, the chances of complete collision is minuscule and where there are collisions, the learning algorithm puts the weight on the non-colliding probes.

The extreme case is the DenseRandomizer.  Every term gets spread out to every feature so you have collisions on every term on every feature.  Because of the random weighting, you preserve enough information to allow effective learning.

See vowpal wabbit for a practical example.  They handle 10^12 (very) sparse features in memory and can learn at disk bandwidth in some applications.

{quote}
Jake: They might belong in a more general place, actually. If I'm going to use some of this stuff in the decompositions (although I'm not sure yet of the efficacy of the single hash for doing SVD), it should go somewhere in the math module.
{quote}

Should we generalize this concept to Vectorizer?  The dictionary approach can accept a previously computed dictionary (possibly augmenting it on the fly) and might be called a DictionaryVectorizer or WeightedDictionaryVectorizer.  At the level I have been working, the storage of the dictionary is an open question.  The randomizers could inherit from the same basic interface (or abstract class).


> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803466#action_12803466 ] 

Sean Owen commented on MAHOUT-220:
----------------------------------

I believe you're clear to commit that code style patch, and this, and close this up.

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789903#action_12789903 ] 

Sean Owen commented on MAHOUT-220:
----------------------------------

I'm all for it. Do you need someone to commit?

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.2
>
>         Attachments: MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795551#action_12795551 ] 

Ted Dunning commented on MAHOUT-220:
------------------------------------

{quote}
FWIW, I'd say stuff that converts text, etc. to our internal representations belongs in the Utils module, where all the "helper" classes are. 
{quote}

That is what I would have liked to do.  We have some problems with dependencies.

Utils depends on core which depends on math.  But the classifier stuff depends on vectorization and is in core.  That means that either the classifier stuff has to be moved to utils (or a new module before utils) or vectorization moved down.  Math seems like a nice home for vectorization, but some kinds need random numbers and RandomUtils is in core.

Bleah.

My final answer is to put the vectorization in core, partly because I am too lazy to re-imagine the entire module structure.  This does raise the question of whether our modules are serving our needs or we are serving theirs.



> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795128#action_12795128 ] 

Jake Mannix commented on MAHOUT-220:
------------------------------------

Anil,

  Your map-reduces look great, that's the kind of thing I've done for this as well.  Good stuff.  

As for HBase and caching layers,  I'd say it's still not fully scalable, as it's limited by whatever cache size you set, and your hit/miss ratio.  It seems the Datastore interface really is just a wrapper around Matrix and Vector, calling out to the entries.  Doing so in a random-access fashion seems like the reverse of the the way I'd do it: pass the Algorithm *to* the Datastore, and have the computations be done where the data lives (iterate over the Datastore internally, either in memory, or if it knows it's backed by mySQL, say, it can batch calls to the db, pulling chunks into memory, if it's HDFS-backed, then it can fire off a M/R job, etc...).

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795131#action_12795131 ] 

Jake Mannix commented on MAHOUT-220:
------------------------------------

bq. I am not very clear what is happening there when two words have the same hash?. Arent we loosing out on a lot of information ?

You can lose some information, sure, but there are *tons* of words, and you don't lose much information.  It is a probabilistic technique though.

Personally I prefer the mutli-hash approach, because at least there I really believe the projection is preserving distances properly.  In the single hash case, sometimes (ie for some single word documents, with different words), the collapse of distance is extreme (as Robin is alluding to).

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-220:
------------------------------

    Attachment: MAHOUT-BAYES.patch

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.2
>
>         Attachments: MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795114#action_12795114 ] 

Jake Mannix commented on MAHOUT-220:
------------------------------------

Robin,

  To really be scalable here, I'm down with the M/R approach for the classifiers.  The random-access nature of the current Datastore interface definitely seems limiting - even using HBase this way means we're making lots of remote calls, while a traditional hadoop job would do the nice "put the coding where the data lives" instead.

Switching over to use SparseVectors and doing things sequentially over the data set stored in SequenceFile's of them seems definitely the way I'd see this going.  Is that what your current hadoopified version of this do?

bq. I am currenly writing a Map/reduce job to convert text documents to vectors without relying on Lucene.

What is the way you're doing this?  Is this bag-of-words representation (what form of tf are you using?  how are you putting in idf if it's fully distributed?)?

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794678#action_12794678 ] 

Robin Anil commented on MAHOUT-220:
-----------------------------------

I will update a new patch. I am reverting all these changes. Will stick to 80 column format and the new lucene code formatter. Will start re-working from latest trunk

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795139#action_12795139 ] 

Robin Anil commented on MAHOUT-220:
-----------------------------------

The current Bayes implementation is an island. if you skim through the training mechanism. Its a very optimised. (with least map/reduces) and the kind of information I store in hbase and in memory is very specific to that paper. 

First there is the weight, which is a matrix of feature as row and label as column and cell as the weight.
Secondly, there is sum of cols and rows. put along with the weight matrix. 
Then there are special rows containing, the theta normalizer and alpha smoothing value etc. 

 You can see its not really doing bayes rule. it is reproducing the math of CBayes paper.  So I see noway of it direcly using the sgd model. 

We could have a Bayes Algo implementation specfic to the model you are training.  If thats ok?

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795137#action_12795137 ] 

Jake Mannix commented on MAHOUT-220:
------------------------------------

bq. The extreme case is the DenseRandomizer. Every term gets spread out to every feature so you have collisions on every term on every feature. Because of the random weighting, you preserve enough information to allow effective learning.

Right, this is the use case in the stochastic decomposition case, cool.

bq. Should we generalize this concept to Vectorizer? The dictionary approach can accept a previously computed dictionary (possibly augmenting it on the fly) and might be called a DictionaryVectorizer or WeightedDictionaryVectorizer. At the level I have been working, the storage of the dictionary is an open question. The randomizers could inherit from the same basic interface (or abstract class).

Definitely.  

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-220) Mahout Bayes Code cleanup

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil resolved MAHOUT-220.
-------------------------------

    Resolution: Fixed

Committed. 

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.