You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Robin Anil <ro...@gmail.com> on 2010/09/25 15:32:51 UTC

Rewrite of CBayes classifier

Hi, I was in the middle of changing the classifier over to to vectors and I
realized how radically it will change for people using it and how difficult
it is to fit the new interfaces ted checked it. There are many components to
it, including the Hbase stuff, which will take a lot of time to port. I
think its best to start from scratch rewrite it, keeping the old version so
that it wont break for users using it?. If that is agreeable, I can complete
a new map/reduce + imemory classifier in o.a.m.c.naivebayes fitting the
interfaces and deprecate the old bayes package?. The new package wont have
the full set of features as the old for 0.4 release. But it will be
functional, and hopefully future proof.  Let me know your thoughts

Robin

Re: Rewrite of CBayes classifier

Posted by Grant Ingersoll <gs...@apache.org>.

We're only on 0.4, I don't think you need to worry too much about back compat.


On Sep 25, 2010, at 9:32 AM, Robin Anil wrote:

> Hi, I was in the middle of changing the classifier over to to vectors and I
> realized how radically it will change for people using it and how difficult
> it is to fit the new interfaces ted checked it. There are many components to
> it, including the Hbase stuff, which will take a lot of time to port. I
> think its best to start from scratch rewrite it, keeping the old version so
> that it wont break for users using it?. If that is agreeable, I can complete
> a new map/reduce + imemory classifier in o.a.m.c.naivebayes fitting the
> interfaces and deprecate the old bayes package?. The new package wont have
> the full set of features as the old for 0.4 release. But it will be
> functional, and hopefully future proof.  Let me know your thoughts
> 
> Robin

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Rewrite of CBayes classifier

Posted by Robin Anil <ro...@gmail.com>.

Bayes and CBayes classifier is implemented as extension of
AbstractVectorClassifier.

In the current version, I had to make the following hacks


   /**

   * Returns the number of categories for the target variable.  A vector
classifier

   * will encode it's output using a zero-based 1 of numCategories encoding.

   * @return The number of categories.

   */

  public abstract int numCategories();


return number of labels;


  /**

   * Classify a vector returning a vector of numCategories-1 scores.  It is
assumed that

   * the score for the missing category is one minus the sum of the scores
that are returned.

   *

   * Note that the missing score is the 0-th score.

   * @param instance  A feature vector to be classified.

   * @return  A vector of probabilities in 1 of n-1 encoding.

   */

  public abstract Vector classify(Vector instance);


The classifier scores are converted to 0-1 score as ratios(clearly, they are
not probabilities) What should I do here?


  /**

   * Classifies a vector in the special case of a binary classifier where

   * <code>classify(Vector)</code> would return a vector with only one
element.  As such,

   * using this method can void the allocation of a vector.

   * @param instance   The feature vector to be classified.

   * @return The score for category 1.

   *

   * @see #classify(Vector)

   */

  public abstract double classifyScalar(Vector instance);

This returns unsuported operation exception



   /**

   * Classify a vector, but don't apply the inverse link function.  For
logistic regression

   * and other generalized linear models, this is just the linear part of
the classification.

   * @param features  A feature vector to be classified.

   * @return  A vector of scores.  If transformed by the link function,
these will become probabilities.

   */

  public Vector classifyNoLink(Vector features);


This returns the scores of each class label in the vector


=====================


Since there is not online learning, I need a batch trainer interface or not?

Is there any interface defined for loading and saving Model ?


More questions later

Re: Rewrite of CBayes classifier

Posted by Robin Anil <ro...@gmail.com>.

>
> I don't understand the value of MultilabelledVector
>
> Currently I am planning a pure M/R Trainer. Having a Labelled/Multilablled
vector means, I will be able to store the label as an int. I can pass in the
list of labels as a parameter and use the items in order to generate the
label ids.


> I will now modify the DictionaryVectorizer to output the sub directory
> chain
> > as a label.
> >
>
> If DictionaryVectorizer is 20 newsgroup specific, then that is OK.  In
> general, there will be too many documents
> to store one per file and it may be difficult to segregate data into one
> category per directory.
>


> SequenceFileFromDirectory will create text sequence files with name as
> > "./Subdir1/Subdir2/file"
> > DictionaryVectorizer will run an extra job which takes the named vectors
> it
> > generates, and makes labelled vectors from them.
> >
>
> I can't have an opinion here.
>
Re: to both

Yes. So I will drop this preprocessing. Let the user write their own
preprocessing. But to complete an end to end example from a directory of
documents(a.k.a 20newsgroups) . I will write the preprocessing as an MR in
examples.


>
> >
> > The questions is the handling of the LabelDictionary. This is a messy way
> > of
> > handling this. Other way is to let naivebayes read data as NamedVectors
> and
> > take care of tokenizing and extracting the label from the name (two
> choices
> >
>
> My big questions center about how this might be used in a production
> setting.  In that case, the assumption
> of input in files breaks down because the user will probably have their own
> intricate input setup.  If we assume
> that the input will be in the form of hashed feature vectors, then the
> following outline seems reasonable to me:
>
>    algorithm = new NaiveBayes(...)
>
>    for all training examples {
>       int actual = target variable value
>       Vector features = vectorize example
>       algorithm.train(actual, features)          // secretly save vector as
> appropriate
>    }
>
This isnt scalable. Single process writing files to cluster. There could be
many ways of forming the input data.

1) like above
2) user writes an M/R over their input data format, writes the data in the
required input format i.e.  tf-vectors or sequence file of text. The tfidf
job will then execute over this(either from tf vectors or from text). to
create the tfidf vectors, then Bayes trainer will execute using this. The
classifier will use the dictionary to map strings to ids and uses dot
product to classify
3) User writes an M/R over their input data format, uses Hashed Encoders to
create vectors. Bayes Trainer executes over the generated file. HashEncoders
are initialized in the classifier in the exact same way and classifier
classifies


Robin

Re: Rewrite of CBayes classifier

Posted by Ted Dunning <te...@gmail.com>.

I don't think I understand the questions entirely.  What you say starts out
easy, but then gets strange (to me).

On Sat, Sep 25, 2010 at 2:53 PM, Robin Anil <ro...@gmail.com> wrote:

> Can I safely assume the input to the naive bayes is a sequence file with
> Text as label (key) an Vector writable as an instance(value)? Or should it
> be a dummy key and MultilabelledVector ? Have we closed the discussions
> about it?
>

Hmm.... key=label, value=Vector sounds really good to me.

I don't understand the value of MultilabelledVector

I will now modify the DictionaryVectorizer to output the sub directory chain
> as a label.
>

If DictionaryVectorizer is 20 newsgroup specific, then that is OK.  In
general, there will be too many documents
to store one per file and it may be difficult to segregate data into one
category per directory.

SequenceFileFromDirectory will create text sequence files with name as
> "./Subdir1/Subdir2/file"
> DictionaryVectorizer will run an extra job which takes the named vectors it
> generates, and makes labelled vectors from them.
>

I can't have an opinion here.

>
> The questions is the handling of the LabelDictionary. This is a messy way
> of
> handling this. Other way is to let naivebayes read data as NamedVectors and
> take care of tokenizing and extracting the label from the name (two choices
>

My big questions center about how this might be used in a production
setting.  In that case, the assumption
of input in files breaks down because the user will probably have their own
intricate input setup.  If we assume
that the input will be in the form of hashed feature vectors, then the
following outline seems reasonable to me:

    algorithm = new NaiveBayes(...)

    for all training examples {
       int actual = target variable value
       Vector features = vectorize example
       algorithm.train(actual, features)          // secretly save vector as
appropriate
    }
    algorithm.close()                                  // map-reduce
actually happens here

My question to you is this, how does this outline mesh with what you are
saying?  Where do you think that the IDF would happen?
What role does the vector dictionary have here?

Re: Rewrite of CBayes classifier

Posted by Robin Anil <ro...@gmail.com>.

Can I safely assume the input to the naive bayes is a sequence file with
Text as label (key) an Vector writable as an instance(value)? Or should it
be a dummy key and MultilabelledVector ? Have we closed the discussions
about it?

I will now modify the DictionaryVectorizer to output the sub directory chain
as a label.

SequenceFileFromDirectory will create text sequence files with name as
"./Subdir1/Subdir2/file"
DictionaryVectorizer will run an extra job which takes the named vectors it
generates, and makes labelled vectors from them.

The questions is the handling of the LabelDictionary. This is a messy way of
handling this. Other way is to let naivebayes read data as NamedVectors and
take care of tokenizing and extracting the label from the name (two choices
String or use a Dictionary lookup to convert it to integers)

Thoughts ?
Robin

On Sun, Sep 26, 2010 at 2:57 AM, Ted Dunning <te...@gmail.com> wrote:

> Log normalization is already in TextValueEncoder
>
> On Sat, Sep 25, 2010 at 2:20 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > Ok. I am going ahead with this. I would ask you to add the
> logNormalization
> > per document as an option in SGD. Jrennie's paper mentions how it
> improves
> > accuracy for text. I don't know how it affects sgd type of learning.
> >
>

Re: Rewrite of CBayes classifier

Posted by Ted Dunning <te...@gmail.com>.

Log normalization is already in TextValueEncoder

On Sat, Sep 25, 2010 at 2:20 PM, Robin Anil <ro...@gmail.com> wrote:

> Ok. I am going ahead with this. I would ask you to add the logNormalization
> per document as an option in SGD. Jrennie's paper mentions how it improves
> accuracy for text. I don't know how it affects sgd type of learning.
>

Re: Rewrite of CBayes classifier

Posted by Robin Anil <ro...@gmail.com>.

> Reasonable approach.  With the sgd code, I avoid an IDF computation by
> using
> an annealed per term feature learning rate.
>
> Ok. I am going ahead with this. I would ask you to add the logNormalization
per document as an option in SGD. Jrennie's paper mentions how it improves
accuracy for text. I don't know how it affects sgd type of learning.

Robin

Re: Rewrite of CBayes classifier

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Sep 25, 2010 at 1:57 PM, Robin Anil <ro...@gmail.com> wrote:

> I currently call it in tf job or idf job at the end when merging the
> partial
> vectors. This throws away the feature counting and tfidf jobs in naive
> bayes. Now all I need is to port the weight summer, and weight
> normalization
> jobs. Just two jobs to create the model from tfidf vectors.
>

Reasonable approach.  With the sgd code, I avoid an IDF computation by using
an annealed per term feature learning rate.

If this annealing goes with 1/n where n is the number of instances seen so
far, the final sum is ~ log N where N is the total number of occurrences.
 That saves a pass through the data which, when you are doing online
learning, is critical.

>
> Or
>
> The naive bayes can generate the model from the vectors generated from
> Hashed Feature vectorizer
>
> Multi field documents can generate a word feature = Field + Word. And use
> dictionary vectorizer or Hash feature vectorizer to convert that to
> vectors.
> I say let there be collisions. Since increasing the number of bits can
> decrease the collision, VW takes that approach. Let the people who worry
> increase the number of bits :)
>

I also provide the ability to probe the vector more than once.  This makes
smaller vectors much more usable
in the same way that Bloom filters can use smaller, over-filled bit vectors.

In production, we clearly see a few cases of collisions when inspecting the
dissected models, but very rarely.

Re: Rewrite of CBayes classifier

Posted by Robin Anil <ro...@gmail.com>.

On Sun, Sep 26, 2010 at 2:05 AM, Ted Dunning <te...@gmail.com> wrote:

> This is a slightly tricky question when it comes to hashed feature vectors
> containing data from several fields.  Especially in cases with very large
> feature sets, collisions within a single document are probable even with
> large feature vectors.
>

I agree, thats why I am going to log normalize only in dictionary
vectorizer. The function can still exist in AbstractVector


  public Vector logNormalize() {

      return logNormalize(2, Math.sqrt(dotSelf()));

  }



  public Vector logNormalize(double power) {

    return logNormalize(power, norm(power));

  }



  public Vector logNormalize(double power, double normLength) {

    // we can special case certain powers

    if (Double.isInfinite(power) || power <= 1.0) {

      throw new IllegalArgumentException("Power must be > 1 and < infinity"
);

    } else {

      double denominator = normLength * Math.log(power);

      Vector result = like().assign(this);

      Iterator<Element> iter = result.iterateNonZero();

      while (iter.hasNext()) {

        Element element = iter.next();

        element.set(Math.log(1 + element.get()) / denominator);

      }

      return result;

    }

  }

I currently call it in tf job or idf job at the end when merging the partial
vectors. This throws away the feature counting and tfidf jobs in naive
bayes. Now all I need is to port the weight summer, and weight normalization
jobs. Just two jobs to create the model from tfidf vectors.

Or

The naive bayes can generate the model from the vectors generated from
Hashed Feature vectorizer

Multi field documents can generate a word feature = Field + Word. And use
dictionary vectorizer or Hash feature vectorizer to convert that to vectors.
I say let there be collisions. Since increasing the number of bits can
decrease the collision, VW takes that approach. Let the people who worry
increase the number of bits :)

Robin



> On Sat, Sep 25, 2010 at 12:46 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > Rewrite Question
> >
> > A key thing that improves accuracy of naivebayes over text is the
> > normalization over TF Vector (V)
> >
> > new V_i = Log(1 + V_i) / SQRT(Sigma_k(V_k));
> >
> > AbstractVector already does L_p norm, does it make sense to add one
> > function
> > to do the above normalization? Say logNormalize(double x). I will be
> adding
> > this to PartialVector Merger (in DictionaryVectorizer). So two choices, I
> > can do this in the Vectorizer or the Vectorizer can call this function ?
> >
> >
> >
> > Robin
> >
> >
> > On Sat, Sep 25, 2010 at 10:22 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > I think it's fine to do a rewrite at this stage. 0.5 sounds like a
> > > nice goal. Just recall that aspects of this will be 'in print' soon so
> > > yeah you want to a) plan to deprecate rather than remove the old code
> > > for some time, b) make the existing code "forwards compatible" with
> > > what you'll do next while you have the chance!
> > >
> > > On Sat, Sep 25, 2010 at 2:32 PM, Robin Anil <ro...@gmail.com>
> > wrote:
> > > > Hi, I was in the middle of changing the classifier over to to vectors
> > and
> > > I
> > > > realized how radically it will change for people using it and how
> > > difficult
> > > > it is to fit the new interfaces ted checked it. There are many
> > components
> > > to
> > > > it, including the Hbase stuff, which will take a lot of time to port.
> I
> > > > think its best to start from scratch rewrite it, keeping the old
> > version
> > > so
> > > > that it wont break for users using it?. If that is agreeable, I can
> > > complete
> > > > a new map/reduce + imemory classifier in o.a.m.c.naivebayes fitting
> the
> > > > interfaces and deprecate the old bayes package?. The new package wont
> > > have
> > > > the full set of features as the old for 0.4 release. But it will be
> > > > functional, and hopefully future proof.  Let me know your thoughts
> > > >
> > > > Robin
> > > >
> > >
> >
>

Re: Rewrite of CBayes classifier

Posted by Ted Dunning <te...@gmail.com>.

This is a slightly tricky question when it comes to hashed feature vectors
containing data from several fields.  Especially in cases with very large
feature sets, collisions within a single document are probable even with
large feature vectors.

I have toyed with several approaches:

- one way is to count the words in the document and only insert log(TF) in a
cleanup phase.  This leads to complexity when you don't get the entire
document at once, but instead get it, say, a line at a time.  Concatenating
the lines in memory and then converting them at once absolutely kills
performance.  org.apache.mahout.vectors.TextValueEncoder takes this approach
and provides addText and flush methods.  The addToVector method combines
these for convenience if you do happen to have the whole thing handy.

- another way is to convert the document progressively into a single vector
and the to the log of that vector before adding it to the real feature
vector.  This avoids the counter table in the value encoder, but rarely goes
pretty wrong in the face of collisions.  I didn't like this approach, but it
would be easy to try and I didn't have specific complaints, just a grumbly
feeling.

- one way that will work for 20 newsgroups as handled by the current naive
bayes code, but will not work in general is to just accumulate data into a
feature vector and then do assign(Functions.LOG) to that feature vector.
 This is like the first half of
the second approach without the second half.  I don't feel that this is a
good approach at all even if it would be faster than either
of the first two approaches.  The major problem is that it makes multi-field
documents impossible to think about.


On Sat, Sep 25, 2010 at 12:46 PM, Robin Anil <ro...@gmail.com> wrote:

> Rewrite Question
>
> A key thing that improves accuracy of naivebayes over text is the
> normalization over TF Vector (V)
>
> new V_i = Log(1 + V_i) / SQRT(Sigma_k(V_k));
>
> AbstractVector already does L_p norm, does it make sense to add one
> function
> to do the above normalization? Say logNormalize(double x). I will be adding
> this to PartialVector Merger (in DictionaryVectorizer). So two choices, I
> can do this in the Vectorizer or the Vectorizer can call this function ?
>
>
>
> Robin
>
>
> On Sat, Sep 25, 2010 at 10:22 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > I think it's fine to do a rewrite at this stage. 0.5 sounds like a
> > nice goal. Just recall that aspects of this will be 'in print' soon so
> > yeah you want to a) plan to deprecate rather than remove the old code
> > for some time, b) make the existing code "forwards compatible" with
> > what you'll do next while you have the chance!
> >
> > On Sat, Sep 25, 2010 at 2:32 PM, Robin Anil <ro...@gmail.com>
> wrote:
> > > Hi, I was in the middle of changing the classifier over to to vectors
> and
> > I
> > > realized how radically it will change for people using it and how
> > difficult
> > > it is to fit the new interfaces ted checked it. There are many
> components
> > to
> > > it, including the Hbase stuff, which will take a lot of time to port. I
> > > think its best to start from scratch rewrite it, keeping the old
> version
> > so
> > > that it wont break for users using it?. If that is agreeable, I can
> > complete
> > > a new map/reduce + imemory classifier in o.a.m.c.naivebayes fitting the
> > > interfaces and deprecate the old bayes package?. The new package wont
> > have
> > > the full set of features as the old for 0.4 release. But it will be
> > > functional, and hopefully future proof.  Let me know your thoughts
> > >
> > > Robin
> > >
> >
>

Re: Rewrite of CBayes classifier

Posted by Robin Anil <ro...@gmail.com>.

Rewrite Question

A key thing that improves accuracy of naivebayes over text is the
normalization over TF Vector (V)

new V_i = Log(1 + V_i) / SQRT(Sigma_k(V_k));

AbstractVector already does L_p norm, does it make sense to add one function
to do the above normalization? Say logNormalize(double x). I will be adding
this to PartialVector Merger (in DictionaryVectorizer). So two choices, I
can do this in the Vectorizer or the Vectorizer can call this function ?



Robin


On Sat, Sep 25, 2010 at 10:22 PM, Sean Owen <sr...@gmail.com> wrote:

> I think it's fine to do a rewrite at this stage. 0.5 sounds like a
> nice goal. Just recall that aspects of this will be 'in print' soon so
> yeah you want to a) plan to deprecate rather than remove the old code
> for some time, b) make the existing code "forwards compatible" with
> what you'll do next while you have the chance!
>
> On Sat, Sep 25, 2010 at 2:32 PM, Robin Anil <ro...@gmail.com> wrote:
> > Hi, I was in the middle of changing the classifier over to to vectors and
> I
> > realized how radically it will change for people using it and how
> difficult
> > it is to fit the new interfaces ted checked it. There are many components
> to
> > it, including the Hbase stuff, which will take a lot of time to port. I
> > think its best to start from scratch rewrite it, keeping the old version
> so
> > that it wont break for users using it?. If that is agreeable, I can
> complete
> > a new map/reduce + imemory classifier in o.a.m.c.naivebayes fitting the
> > interfaces and deprecate the old bayes package?. The new package wont
> have
> > the full set of features as the old for 0.4 release. But it will be
> > functional, and hopefully future proof.  Let me know your thoughts
> >
> > Robin
> >
>

Re: Rewrite of CBayes classifier

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  +1

On 9/26/10 9:54 PM, Drew Farris wrote:
> On Sun, Sep 26, 2010 at 7:48 PM, Sean Owen<sr...@gmail.com>  wrote:
>> No, I just meant it would be better to deprecate than remove it. And
>> deprecation can come later. Better still to make it all
>> backwards-compatible as is possible.
>>
> It would be great to deprecate the existing bayes code>post<  0.4 --
> the new vector bayes code is certainly coming into being very close to
> the 0.4 release. How about considering new vector-based bayes code
> experimental for the 0.4 release, with the goal of fully certifying it
> for the 0.5 release and deprecating the existing bayes on 0.5.
>

Re: Rewrite of CBayes classifier

Posted by Sean Owen <sr...@gmail.com>.

That was my intent, indeed.

BTW we still want to cut 0.4 soon. The issue list is shrinking but
still about 10 out there. Let's lean more and more towards pushing
issues out if not done in a day or two, and lean towards tying up any
loose ends now.

(That is also to say, I don't think new classifiers belong in 0.4.)

On Mon, Sep 27, 2010 at 2:54 AM, Drew Farris <dr...@gmail.com> wrote:
> On Sun, Sep 26, 2010 at 7:48 PM, Sean Owen <sr...@gmail.com> wrote:
>> No, I just meant it would be better to deprecate than remove it. And
>> deprecation can come later. Better still to make it all
>> backwards-compatible as is possible.
>>
>
> It would be great to deprecate the existing bayes code >post< 0.4 --
> the new vector bayes code is certainly coming into being very close to
> the 0.4 release. How about considering new vector-based bayes code
> experimental for the 0.4 release, with the goal of fully certifying it
> for the 0.5 release and deprecating the existing bayes on 0.5.
>

Re: Rewrite of CBayes classifier

Posted by Robin Anil <ro...@gmail.com>.

This was just a "thought", once I finish a working version of the New Bayes.
Need everyone to be comfortable with deprecation, but Drew's plan sounds
perfect

Robin

On Mon, Sep 27, 2010 at 7:24 AM, Drew Farris <dr...@gmail.com> wrote:

> On Sun, Sep 26, 2010 at 7:48 PM, Sean Owen <sr...@gmail.com> wrote:
> > No, I just meant it would be better to deprecate than remove it. And
> > deprecation can come later. Better still to make it all
> > backwards-compatible as is possible.
> >
>
> It would be great to deprecate the existing bayes code >post< 0.4 --
> the new vector bayes code is certainly coming into being very close to
> the 0.4 release. How about considering new vector-based bayes code
> experimental for the 0.4 release, with the goal of fully certifying it
> for the 0.5 release and deprecating the existing bayes on 0.5.
>

Re: Rewrite of CBayes classifier

Posted by Drew Farris <dr...@gmail.com>.

On Sun, Sep 26, 2010 at 7:48 PM, Sean Owen <sr...@gmail.com> wrote:
> No, I just meant it would be better to deprecate than remove it. And
> deprecation can come later. Better still to make it all
> backwards-compatible as is possible.
>

It would be great to deprecate the existing bayes code >post< 0.4 --
the new vector bayes code is certainly coming into being very close to
the 0.4 release. How about considering new vector-based bayes code
experimental for the 0.4 release, with the goal of fully certifying it
for the 0.5 release and deprecating the existing bayes on 0.5.

Re: Rewrite of CBayes classifier

Posted by Sean Owen <sr...@gmail.com>.

No, I just meant it would be better to deprecate than remove it. And
deprecation can come later. Better still to make it all
backwards-compatible as is possible.

On Mon, Sep 27, 2010 at 12:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Sep 25, 2010, at 12:52 PM, Sean Owen wrote:
>
>> I think it's fine to do a rewrite at this stage. 0.5 sounds like a
>> nice goal. Just recall that aspects of this will be 'in print' soon so
>> yeah you want to a) plan to deprecate rather than remove the old code
>> for some time,
>
> As much as I look forward to the book, do you really feel it is a good idea to deprecate code at this stage of the game?
>
> -Grant

Re: Rewrite of CBayes classifier

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 25, 2010, at 12:52 PM, Sean Owen wrote:

> I think it's fine to do a rewrite at this stage. 0.5 sounds like a
> nice goal. Just recall that aspects of this will be 'in print' soon so
> yeah you want to a) plan to deprecate rather than remove the old code
> for some time,

As much as I look forward to the book, do you really feel it is a good idea to deprecate code at this stage of the game?

-Grant

Re: Rewrite of CBayes classifier

Posted by Ted Dunning <te...@gmail.com>.

The stuff in the book is largely just the command line stuff and doesn't go
into much detail about API access to the NB stuff.

My own lamentable tendency would be to prototype the algorithm using Plume,
partly to push Plume development and partly to see if that helps build API
oriented map-reduce components.  I say lamentable because a straightforward
implementation would definitely be done sooner.

On Sat, Sep 25, 2010 at 9:52 AM, Sean Owen <sr...@gmail.com> wrote:

> I think it's fine to do a rewrite at this stage. 0.5 sounds like a
> nice goal. Just recall that aspects of this will be 'in print' soon so
> yeah you want to a) plan to deprecate rather than remove the old code
> for some time, b) make the existing code "forwards compatible" with
> what you'll do next while you have the chance!
>
> On Sat, Sep 25, 2010 at 2:32 PM, Robin Anil <ro...@gmail.com> wrote:
> > Hi, I was in the middle of changing the classifier over to to vectors and
> I
> > realized how radically it will change for people using it and how
> difficult
> > it is to fit the new interfaces ted checked it. There are many components
> to
> > it, including the Hbase stuff, which will take a lot of time to port. I
> > think its best to start from scratch rewrite it, keeping the old version
> so
> > that it wont break for users using it?. If that is agreeable, I can
> complete
> > a new map/reduce + imemory classifier in o.a.m.c.naivebayes fitting the
> > interfaces and deprecate the old bayes package?. The new package wont
> have
> > the full set of features as the old for 0.4 release. But it will be
> > functional, and hopefully future proof.  Let me know your thoughts
> >
> > Robin
> >
>

Re: Rewrite of CBayes classifier

Posted by Sean Owen <sr...@gmail.com>.

I think it's fine to do a rewrite at this stage. 0.5 sounds like a
nice goal. Just recall that aspects of this will be 'in print' soon so
yeah you want to a) plan to deprecate rather than remove the old code
for some time, b) make the existing code "forwards compatible" with
what you'll do next while you have the chance!

On Sat, Sep 25, 2010 at 2:32 PM, Robin Anil <ro...@gmail.com> wrote:
> Hi, I was in the middle of changing the classifier over to to vectors and I
> realized how radically it will change for people using it and how difficult
> it is to fit the new interfaces ted checked it. There are many components to
> it, including the Hbase stuff, which will take a lot of time to port. I
> think its best to start from scratch rewrite it, keeping the old version so
> that it wont break for users using it?. If that is agreeable, I can complete
> a new map/reduce + imemory classifier in o.a.m.c.naivebayes fitting the
> interfaces and deprecate the old bayes package?. The new package wont have
> the full set of features as the old for 0.4 release. But it will be
> functional, and hopefully future proof.  Let me know your thoughts
>
> Robin
>