You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by David Rahman <dr...@googlemail.com> on 2011/11/03 15:25:29 UTC

Re: confidence values of one (or more) feature(s)

Me again,

can someone point me to right direction? How can I access these features?
I looked into the summary(int n) -method located in
org.apache.mahout.classifier.sgd.Modeldissector.java, but somehow I don't
understand how it works.

Could someone explain to me how it works? As I understand it, it returns
just the max-value of a feature.

Thanks and regards,
David

2011/10/20 David Rahman <dr...@googlemail.com>

> Hi,
>
> how can I access the confidence values of one (or more) feature(s) with
> its possibilities?
>
> In the 20Newsgroup-example, there is the dissect method, within there is
> used summary(int n), which returns the n most important features with their
> weights. I want also the features which are placed second or third (or
> more). How can I access those?
>
> Regards,
> David
>

Re: confidence values of one (or more) feature(s)

Posted by Ted Dunning <te...@gmail.com>.
Here are some hints.

https://cwiki.apache.org/MAHOUT/how-to-contribute.html

It is really easy and we would happy to help.

On Thu, Nov 3, 2011 at 1:48 PM, David Rahman <dr...@googlemail.com>wrote:

> Never done that before, but I will look into it. As an alternative I could
> send it to your email. But first I have to implement it successfully.
>
> Thanks again and regards,
> David
>
> 2011/11/3 Ted Dunning <te...@gmail.com>
>
> > If you do get to that, could you write up a JIRA and attach a patch?
> >
> > On Thu, Nov 3, 2011 at 1:33 PM, David Rahman <drahman1985@googlemail.com
> > >wrote:
> >
> > > Thank you Ted,
> > >
> > > I will test the methods next week, when I'm back in the office and let
> > you
> > > know how it went.
> > >
> > > Thank you and best regards,
> > > David
> > >
> > > 2011/11/3 Ted Dunning <te...@gmail.com>
> > >
> > > > OK.
> > > >
> > > > So the simplest design in Mahout terms is a binary classifier for
> each
> > > > keyword (if the keywords are not mutually exclusive).  If you can
> > define
> > > a
> > > > useful ordering for terms or have some logical entailment, you may
> want
> > > to
> > > > allow the presence of some terms to be features for certain other
> > terms.
> > > >
> > > > So the question boils down to how to ask a binary logistic regression
> > how
> > > > it came to its conclusion.
> > > >
> > > > You are correct to look to the model dissector for the function you
> > want,
> > > > but you will have to call it in a little bit unusual way because it
> is
> > > > really intended to describe a model rather than a single decision.
>  The
> > > > logistic regression functions in Mahout don't actually expose quite
> as
> > > much
> > > > information as you need for this, but if you add this method, you
> > should
> > > > get the basic information you need:
> > > >
> > > >        /**
> > > >   * Return the element-wise product of the feature vector versus each
> > > > column
> > > >   * of the beta matrix.  This can then be used to extract the most
> > > > interesting
> > > >   * features for a decision for each alternative output.
> > > >   * @param instance  A feature vector
> > > >   * @return   A matrix like beta but with each column multiplied by
> > > > instance.
> > > >   */
> > > >  public Matrix explain(Vector instance) {
> > > >    regularize(instance);
> > > >    Matrix r = beta.like().assign(beta);
> > > >    for (int column = 0; column < r.columnSize(); column++) {
> > > >      r.viewColumn(column).assign(instance, Functions.MULT);
> > > >    }
> > > >    return r;
> > > >  }
> > > >
> > > >
> > > > Then to explain your binary model, you probably want some code like
> > this:
> > > >
> > > >   Map<String, Set<Integer>> traceDictionary = Maps.newHashSet();
> > > >   Vector instance = encode(data, traceDictionary)
> > > >   Matrix b = model.explain(instance);
> > > >
> > > >   ModelDissector md = new ModelDissector();
> > > >   // get positive terms
> > > >   ModelDissector.update(b.getColumn(0), td, model);
> > > >   // scan through the top terms
> > > >   ...
> > > >
> > > >   md = new ModelDissector();
> > > >   ModelDissector.update(b.getColumn(0).assign(Functions.NEGATE), td,
> > > > model);
> > > >   // scan through the most negative terms
> > > >   ...
> > > >
> > > > Note that all of this code is untested and I could be out to lunch
> > here.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Nov 3, 2011 at 12:19 PM, David Rahman <
> > > drahman1985@googlemail.com
> > > > >wrote:
> > > >
> > > > > Hi Ted,
> > > > >
> > > > > I want to have the model explain why it classified documents in a
> > > certain
> > > > > way. That should be enough at first.
> > > > >
> > > > > I want to classify documents, each document has a corresponding set
> > of
> > > > > keywords. The model should be able to classify unknown documents
> and
> > > > > provide a number of suggustions of keywords. Later on it should be
> > > > possible
> > > > > to build a search term recommender for a search engine with
> > classified
> > > > > documents as a basis.
> > > > >
> > > > > At first we wanted to use the lucene data, but the existing data is
> > > build
> > > > > with an older lucene version, so the data is provided in xml, for
> > now.
> > > > It's
> > > > > like the wikipedia example, only with more possible keywords.
> > > > >
> > > > > Hope it's understandable.
> > > > >
> > > > > Thanks for your endurance and regards,
> > > > > David
> > > > >
> > > > > 2011/11/3 Ted Dunning <te...@gmail.com>
> > > > >
> > > > > > I am sorry for being dense, but I don't really understand what
> you
> > > are
> > > > > > trying to do.
> > > > > >
> > > > > > As I see it,
> > > > > >
> > > > > > - the input is documents
> > > > > >
> > > > > > - the output is a category
> > > > > >
> > > > > > You want one or more of the following,
> > > > > >
> > > > > > - to have the model explain why it classified documents a certain
> > way
> > > > > >
> > > > > > or
> > > > > >
> > > > > > - to classify non-document phrases a certain way
> > > > > >
> > > > > > or
> > > > > >
> > > > > > - to have the model show its internal structure to you
> > > > > >
> > > > > > or
> > > > > >
> > > > > > - something else entirely
> > > > > >
> > > > > > Can you say what you want in these terms?
> > > > > >
> > > > > > On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <
> > > > drahman1985@googlemail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > Hi Ted,
> > > > > > >
> > > > > > > thank you for the explanation.
> > > > > > > For example imagine a term cloud, in which terms are presented.
> > > Some
> > > > > > terms
> > > > > > > are bigger than other, because they are more likely than the
> > other
> > > > > > terms. I
> > > > > > > would need those results for analysis. We want to compare
> > different
> > > > > > > ML-algorithms and methods and/or compinations of them. And
> first
> > I
> > > > have
> > > > > > to
> > > > > > > gain some basic knowledge about Mahout.
> > > > > > >
> > > > > > > For example, when I take the word 'social' as input I'd like to
> > > have
> > > > > that
> > > > > > > result:
> > > > > > >
> > > > > > > social                    1.0
> > > > > > > social media           0.8
> > > > > > > social networking    0.65
> > > > > > > social news            0.6
> > > > > > > facebook                0.5
> > > > > > > ...
> > > > > > >
> > > > > > > (ignore those values, it's not correct, but it should show
> what I
> > > > need)
> > > > > > >
> > > > > > > The 20Newsgroup-example shows with the summary(int n) method
> the
> > > most
> > > > > > > likely categorisation of a term (--> the most important
> > feature). I
> > > > > would
> > > > > > > like to have a list with the second, third, and so on important
> > > > > feature.
> > > > > > I
> > > > > > > imagine, while computing the features, only the most import
> ones
> > > are
> > > > > > added
> > > > > > > to the list and the less important features are rejected.
> > > > > > >
> > > > > > > Thanks and regards,
> > > > > > > David
> > > > > > >
> > > > > > > 2011/11/3 Ted Dunning <te...@gmail.com>
> > > > > > >
> > > > > > > > There are no confidence values per se in the models computed
> by
> > > > > Mahout
> > > > > > at
> > > > > > > > this time.
> > > > > > > >
> > > > > > > > There are several issues here,
> > > > > > > >
> > > > > > > > 1) Naive Bayes doesn't have such a concept.  'Nuff said
> there.
> > > > > > > >
> > > > > > > > 2) SGD logistic regresssion could compute confidence
> intervals,
> > > > but I
> > > > > > am
> > > > > > > > not quite sure how to do that with stochastic gradient
> descent.
> > > > > > > >
> > > > > > > > 3) in most uses of Mahout's logistic regression, the issues
> are
> > > > data
> > > > > > size
> > > > > > > > and feature set size.  Confidence values are typically used
> for
> > > > > > selecting
> > > > > > > > features which is typically not a viable strategy for
> problems
> > > with
> > > > > > very
> > > > > > > > large feature sets.  That is what the L1 regularization is
> all
> > > > about.
> > > > > > > >
> > > > > > > > 4) with an extremely large number features, the noise on
> > > confidence
> > > > > > > > intervals makes them very hard to understand
> > > > > > > >
> > > > > > > > 5) with hashed features and feature collisions it is hard
> > enough
> > > to
> > > > > > > > understand which feature is doing what, much less what the
> > > > confidence
> > > > > > > > interval means.
> > > > > > > >
> > > > > > > > Can you say more about your problem?  Is it small enough to
> use
> > > > > > bayesglm
> > > > > > > in
> > > > > > > > R?
> > > > > > > >
> > > > > > > > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <
> > > > > > drahman1985@googlemail.com
> > > > > > > > >wrote:
> > > > > > > >
> > > > > > > > > Me again,
> > > > > > > > >
> > > > > > > > > can someone point me to right direction? How can I access
> > these
> > > > > > > features?
> > > > > > > > > I looked into the summary(int n) -method located in
> > > > > > > > > org.apache.mahout.classifier.sgd.Modeldissector.java, but
> > > > somehow I
> > > > > > > don't
> > > > > > > > > understand how it works.
> > > > > > > > >
> > > > > > > > > Could someone explain to me how it works? As I understand
> it,
> > > it
> > > > > > > returns
> > > > > > > > > just the max-value of a feature.
> > > > > > > > >
> > > > > > > > > Thanks and regards,
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > > 2011/10/20 David Rahman <dr...@googlemail.com>
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > how can I access the confidence values of one (or more)
> > > > > feature(s)
> > > > > > > with
> > > > > > > > > > its possibilities?
> > > > > > > > > >
> > > > > > > > > > In the 20Newsgroup-example, there is the dissect method,
> > > within
> > > > > > there
> > > > > > > > is
> > > > > > > > > > used summary(int n), which returns the n most important
> > > > features
> > > > > > with
> > > > > > > > > their
> > > > > > > > > > weights. I want also the features which are placed second
> > or
> > > > > third
> > > > > > > (or
> > > > > > > > > > more). How can I access those?
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > David
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: confidence values of one (or more) feature(s)

Posted by David Rahman <dr...@googlemail.com>.
Never done that before, but I will look into it. As an alternative I could
send it to your email. But first I have to implement it successfully.

Thanks again and regards,
David

2011/11/3 Ted Dunning <te...@gmail.com>

> If you do get to that, could you write up a JIRA and attach a patch?
>
> On Thu, Nov 3, 2011 at 1:33 PM, David Rahman <drahman1985@googlemail.com
> >wrote:
>
> > Thank you Ted,
> >
> > I will test the methods next week, when I'm back in the office and let
> you
> > know how it went.
> >
> > Thank you and best regards,
> > David
> >
> > 2011/11/3 Ted Dunning <te...@gmail.com>
> >
> > > OK.
> > >
> > > So the simplest design in Mahout terms is a binary classifier for each
> > > keyword (if the keywords are not mutually exclusive).  If you can
> define
> > a
> > > useful ordering for terms or have some logical entailment, you may want
> > to
> > > allow the presence of some terms to be features for certain other
> terms.
> > >
> > > So the question boils down to how to ask a binary logistic regression
> how
> > > it came to its conclusion.
> > >
> > > You are correct to look to the model dissector for the function you
> want,
> > > but you will have to call it in a little bit unusual way because it is
> > > really intended to describe a model rather than a single decision.  The
> > > logistic regression functions in Mahout don't actually expose quite as
> > much
> > > information as you need for this, but if you add this method, you
> should
> > > get the basic information you need:
> > >
> > >        /**
> > >   * Return the element-wise product of the feature vector versus each
> > > column
> > >   * of the beta matrix.  This can then be used to extract the most
> > > interesting
> > >   * features for a decision for each alternative output.
> > >   * @param instance  A feature vector
> > >   * @return   A matrix like beta but with each column multiplied by
> > > instance.
> > >   */
> > >  public Matrix explain(Vector instance) {
> > >    regularize(instance);
> > >    Matrix r = beta.like().assign(beta);
> > >    for (int column = 0; column < r.columnSize(); column++) {
> > >      r.viewColumn(column).assign(instance, Functions.MULT);
> > >    }
> > >    return r;
> > >  }
> > >
> > >
> > > Then to explain your binary model, you probably want some code like
> this:
> > >
> > >   Map<String, Set<Integer>> traceDictionary = Maps.newHashSet();
> > >   Vector instance = encode(data, traceDictionary)
> > >   Matrix b = model.explain(instance);
> > >
> > >   ModelDissector md = new ModelDissector();
> > >   // get positive terms
> > >   ModelDissector.update(b.getColumn(0), td, model);
> > >   // scan through the top terms
> > >   ...
> > >
> > >   md = new ModelDissector();
> > >   ModelDissector.update(b.getColumn(0).assign(Functions.NEGATE), td,
> > > model);
> > >   // scan through the most negative terms
> > >   ...
> > >
> > > Note that all of this code is untested and I could be out to lunch
> here.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 3, 2011 at 12:19 PM, David Rahman <
> > drahman1985@googlemail.com
> > > >wrote:
> > >
> > > > Hi Ted,
> > > >
> > > > I want to have the model explain why it classified documents in a
> > certain
> > > > way. That should be enough at first.
> > > >
> > > > I want to classify documents, each document has a corresponding set
> of
> > > > keywords. The model should be able to classify unknown documents and
> > > > provide a number of suggustions of keywords. Later on it should be
> > > possible
> > > > to build a search term recommender for a search engine with
> classified
> > > > documents as a basis.
> > > >
> > > > At first we wanted to use the lucene data, but the existing data is
> > build
> > > > with an older lucene version, so the data is provided in xml, for
> now.
> > > It's
> > > > like the wikipedia example, only with more possible keywords.
> > > >
> > > > Hope it's understandable.
> > > >
> > > > Thanks for your endurance and regards,
> > > > David
> > > >
> > > > 2011/11/3 Ted Dunning <te...@gmail.com>
> > > >
> > > > > I am sorry for being dense, but I don't really understand what you
> > are
> > > > > trying to do.
> > > > >
> > > > > As I see it,
> > > > >
> > > > > - the input is documents
> > > > >
> > > > > - the output is a category
> > > > >
> > > > > You want one or more of the following,
> > > > >
> > > > > - to have the model explain why it classified documents a certain
> way
> > > > >
> > > > > or
> > > > >
> > > > > - to classify non-document phrases a certain way
> > > > >
> > > > > or
> > > > >
> > > > > - to have the model show its internal structure to you
> > > > >
> > > > > or
> > > > >
> > > > > - something else entirely
> > > > >
> > > > > Can you say what you want in these terms?
> > > > >
> > > > > On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <
> > > drahman1985@googlemail.com
> > > > > >wrote:
> > > > >
> > > > > > Hi Ted,
> > > > > >
> > > > > > thank you for the explanation.
> > > > > > For example imagine a term cloud, in which terms are presented.
> > Some
> > > > > terms
> > > > > > are bigger than other, because they are more likely than the
> other
> > > > > terms. I
> > > > > > would need those results for analysis. We want to compare
> different
> > > > > > ML-algorithms and methods and/or compinations of them. And first
> I
> > > have
> > > > > to
> > > > > > gain some basic knowledge about Mahout.
> > > > > >
> > > > > > For example, when I take the word 'social' as input I'd like to
> > have
> > > > that
> > > > > > result:
> > > > > >
> > > > > > social                    1.0
> > > > > > social media           0.8
> > > > > > social networking    0.65
> > > > > > social news            0.6
> > > > > > facebook                0.5
> > > > > > ...
> > > > > >
> > > > > > (ignore those values, it's not correct, but it should show what I
> > > need)
> > > > > >
> > > > > > The 20Newsgroup-example shows with the summary(int n) method the
> > most
> > > > > > likely categorisation of a term (--> the most important
> feature). I
> > > > would
> > > > > > like to have a list with the second, third, and so on important
> > > > feature.
> > > > > I
> > > > > > imagine, while computing the features, only the most import ones
> > are
> > > > > added
> > > > > > to the list and the less important features are rejected.
> > > > > >
> > > > > > Thanks and regards,
> > > > > > David
> > > > > >
> > > > > > 2011/11/3 Ted Dunning <te...@gmail.com>
> > > > > >
> > > > > > > There are no confidence values per se in the models computed by
> > > > Mahout
> > > > > at
> > > > > > > this time.
> > > > > > >
> > > > > > > There are several issues here,
> > > > > > >
> > > > > > > 1) Naive Bayes doesn't have such a concept.  'Nuff said there.
> > > > > > >
> > > > > > > 2) SGD logistic regresssion could compute confidence intervals,
> > > but I
> > > > > am
> > > > > > > not quite sure how to do that with stochastic gradient descent.
> > > > > > >
> > > > > > > 3) in most uses of Mahout's logistic regression, the issues are
> > > data
> > > > > size
> > > > > > > and feature set size.  Confidence values are typically used for
> > > > > selecting
> > > > > > > features which is typically not a viable strategy for problems
> > with
> > > > > very
> > > > > > > large feature sets.  That is what the L1 regularization is all
> > > about.
> > > > > > >
> > > > > > > 4) with an extremely large number features, the noise on
> > confidence
> > > > > > > intervals makes them very hard to understand
> > > > > > >
> > > > > > > 5) with hashed features and feature collisions it is hard
> enough
> > to
> > > > > > > understand which feature is doing what, much less what the
> > > confidence
> > > > > > > interval means.
> > > > > > >
> > > > > > > Can you say more about your problem?  Is it small enough to use
> > > > > bayesglm
> > > > > > in
> > > > > > > R?
> > > > > > >
> > > > > > > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <
> > > > > drahman1985@googlemail.com
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > Me again,
> > > > > > > >
> > > > > > > > can someone point me to right direction? How can I access
> these
> > > > > > features?
> > > > > > > > I looked into the summary(int n) -method located in
> > > > > > > > org.apache.mahout.classifier.sgd.Modeldissector.java, but
> > > somehow I
> > > > > > don't
> > > > > > > > understand how it works.
> > > > > > > >
> > > > > > > > Could someone explain to me how it works? As I understand it,
> > it
> > > > > > returns
> > > > > > > > just the max-value of a feature.
> > > > > > > >
> > > > > > > > Thanks and regards,
> > > > > > > > David
> > > > > > > >
> > > > > > > > 2011/10/20 David Rahman <dr...@googlemail.com>
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > how can I access the confidence values of one (or more)
> > > > feature(s)
> > > > > > with
> > > > > > > > > its possibilities?
> > > > > > > > >
> > > > > > > > > In the 20Newsgroup-example, there is the dissect method,
> > within
> > > > > there
> > > > > > > is
> > > > > > > > > used summary(int n), which returns the n most important
> > > features
> > > > > with
> > > > > > > > their
> > > > > > > > > weights. I want also the features which are placed second
> or
> > > > third
> > > > > > (or
> > > > > > > > > more). How can I access those?
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > David
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: confidence values of one (or more) feature(s)

Posted by Ted Dunning <te...@gmail.com>.
If you do get to that, could you write up a JIRA and attach a patch?

On Thu, Nov 3, 2011 at 1:33 PM, David Rahman <dr...@googlemail.com>wrote:

> Thank you Ted,
>
> I will test the methods next week, when I'm back in the office and let you
> know how it went.
>
> Thank you and best regards,
> David
>
> 2011/11/3 Ted Dunning <te...@gmail.com>
>
> > OK.
> >
> > So the simplest design in Mahout terms is a binary classifier for each
> > keyword (if the keywords are not mutually exclusive).  If you can define
> a
> > useful ordering for terms or have some logical entailment, you may want
> to
> > allow the presence of some terms to be features for certain other terms.
> >
> > So the question boils down to how to ask a binary logistic regression how
> > it came to its conclusion.
> >
> > You are correct to look to the model dissector for the function you want,
> > but you will have to call it in a little bit unusual way because it is
> > really intended to describe a model rather than a single decision.  The
> > logistic regression functions in Mahout don't actually expose quite as
> much
> > information as you need for this, but if you add this method, you should
> > get the basic information you need:
> >
> >        /**
> >   * Return the element-wise product of the feature vector versus each
> > column
> >   * of the beta matrix.  This can then be used to extract the most
> > interesting
> >   * features for a decision for each alternative output.
> >   * @param instance  A feature vector
> >   * @return   A matrix like beta but with each column multiplied by
> > instance.
> >   */
> >  public Matrix explain(Vector instance) {
> >    regularize(instance);
> >    Matrix r = beta.like().assign(beta);
> >    for (int column = 0; column < r.columnSize(); column++) {
> >      r.viewColumn(column).assign(instance, Functions.MULT);
> >    }
> >    return r;
> >  }
> >
> >
> > Then to explain your binary model, you probably want some code like this:
> >
> >   Map<String, Set<Integer>> traceDictionary = Maps.newHashSet();
> >   Vector instance = encode(data, traceDictionary)
> >   Matrix b = model.explain(instance);
> >
> >   ModelDissector md = new ModelDissector();
> >   // get positive terms
> >   ModelDissector.update(b.getColumn(0), td, model);
> >   // scan through the top terms
> >   ...
> >
> >   md = new ModelDissector();
> >   ModelDissector.update(b.getColumn(0).assign(Functions.NEGATE), td,
> > model);
> >   // scan through the most negative terms
> >   ...
> >
> > Note that all of this code is untested and I could be out to lunch here.
> >
> >
> >
> >
> > On Thu, Nov 3, 2011 at 12:19 PM, David Rahman <
> drahman1985@googlemail.com
> > >wrote:
> >
> > > Hi Ted,
> > >
> > > I want to have the model explain why it classified documents in a
> certain
> > > way. That should be enough at first.
> > >
> > > I want to classify documents, each document has a corresponding set of
> > > keywords. The model should be able to classify unknown documents and
> > > provide a number of suggustions of keywords. Later on it should be
> > possible
> > > to build a search term recommender for a search engine with classified
> > > documents as a basis.
> > >
> > > At first we wanted to use the lucene data, but the existing data is
> build
> > > with an older lucene version, so the data is provided in xml, for now.
> > It's
> > > like the wikipedia example, only with more possible keywords.
> > >
> > > Hope it's understandable.
> > >
> > > Thanks for your endurance and regards,
> > > David
> > >
> > > 2011/11/3 Ted Dunning <te...@gmail.com>
> > >
> > > > I am sorry for being dense, but I don't really understand what you
> are
> > > > trying to do.
> > > >
> > > > As I see it,
> > > >
> > > > - the input is documents
> > > >
> > > > - the output is a category
> > > >
> > > > You want one or more of the following,
> > > >
> > > > - to have the model explain why it classified documents a certain way
> > > >
> > > > or
> > > >
> > > > - to classify non-document phrases a certain way
> > > >
> > > > or
> > > >
> > > > - to have the model show its internal structure to you
> > > >
> > > > or
> > > >
> > > > - something else entirely
> > > >
> > > > Can you say what you want in these terms?
> > > >
> > > > On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <
> > drahman1985@googlemail.com
> > > > >wrote:
> > > >
> > > > > Hi Ted,
> > > > >
> > > > > thank you for the explanation.
> > > > > For example imagine a term cloud, in which terms are presented.
> Some
> > > > terms
> > > > > are bigger than other, because they are more likely than the other
> > > > terms. I
> > > > > would need those results for analysis. We want to compare different
> > > > > ML-algorithms and methods and/or compinations of them. And first I
> > have
> > > > to
> > > > > gain some basic knowledge about Mahout.
> > > > >
> > > > > For example, when I take the word 'social' as input I'd like to
> have
> > > that
> > > > > result:
> > > > >
> > > > > social                    1.0
> > > > > social media           0.8
> > > > > social networking    0.65
> > > > > social news            0.6
> > > > > facebook                0.5
> > > > > ...
> > > > >
> > > > > (ignore those values, it's not correct, but it should show what I
> > need)
> > > > >
> > > > > The 20Newsgroup-example shows with the summary(int n) method the
> most
> > > > > likely categorisation of a term (--> the most important feature). I
> > > would
> > > > > like to have a list with the second, third, and so on important
> > > feature.
> > > > I
> > > > > imagine, while computing the features, only the most import ones
> are
> > > > added
> > > > > to the list and the less important features are rejected.
> > > > >
> > > > > Thanks and regards,
> > > > > David
> > > > >
> > > > > 2011/11/3 Ted Dunning <te...@gmail.com>
> > > > >
> > > > > > There are no confidence values per se in the models computed by
> > > Mahout
> > > > at
> > > > > > this time.
> > > > > >
> > > > > > There are several issues here,
> > > > > >
> > > > > > 1) Naive Bayes doesn't have such a concept.  'Nuff said there.
> > > > > >
> > > > > > 2) SGD logistic regresssion could compute confidence intervals,
> > but I
> > > > am
> > > > > > not quite sure how to do that with stochastic gradient descent.
> > > > > >
> > > > > > 3) in most uses of Mahout's logistic regression, the issues are
> > data
> > > > size
> > > > > > and feature set size.  Confidence values are typically used for
> > > > selecting
> > > > > > features which is typically not a viable strategy for problems
> with
> > > > very
> > > > > > large feature sets.  That is what the L1 regularization is all
> > about.
> > > > > >
> > > > > > 4) with an extremely large number features, the noise on
> confidence
> > > > > > intervals makes them very hard to understand
> > > > > >
> > > > > > 5) with hashed features and feature collisions it is hard enough
> to
> > > > > > understand which feature is doing what, much less what the
> > confidence
> > > > > > interval means.
> > > > > >
> > > > > > Can you say more about your problem?  Is it small enough to use
> > > > bayesglm
> > > > > in
> > > > > > R?
> > > > > >
> > > > > > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <
> > > > drahman1985@googlemail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > Me again,
> > > > > > >
> > > > > > > can someone point me to right direction? How can I access these
> > > > > features?
> > > > > > > I looked into the summary(int n) -method located in
> > > > > > > org.apache.mahout.classifier.sgd.Modeldissector.java, but
> > somehow I
> > > > > don't
> > > > > > > understand how it works.
> > > > > > >
> > > > > > > Could someone explain to me how it works? As I understand it,
> it
> > > > > returns
> > > > > > > just the max-value of a feature.
> > > > > > >
> > > > > > > Thanks and regards,
> > > > > > > David
> > > > > > >
> > > > > > > 2011/10/20 David Rahman <dr...@googlemail.com>
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > how can I access the confidence values of one (or more)
> > > feature(s)
> > > > > with
> > > > > > > > its possibilities?
> > > > > > > >
> > > > > > > > In the 20Newsgroup-example, there is the dissect method,
> within
> > > > there
> > > > > > is
> > > > > > > > used summary(int n), which returns the n most important
> > features
> > > > with
> > > > > > > their
> > > > > > > > weights. I want also the features which are placed second or
> > > third
> > > > > (or
> > > > > > > > more). How can I access those?
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > David
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: confidence values of one (or more) feature(s)

Posted by David Rahman <dr...@googlemail.com>.
Thank you Ted,

I will test the methods next week, when I'm back in the office and let you
know how it went.

Thank you and best regards,
David

2011/11/3 Ted Dunning <te...@gmail.com>

> OK.
>
> So the simplest design in Mahout terms is a binary classifier for each
> keyword (if the keywords are not mutually exclusive).  If you can define a
> useful ordering for terms or have some logical entailment, you may want to
> allow the presence of some terms to be features for certain other terms.
>
> So the question boils down to how to ask a binary logistic regression how
> it came to its conclusion.
>
> You are correct to look to the model dissector for the function you want,
> but you will have to call it in a little bit unusual way because it is
> really intended to describe a model rather than a single decision.  The
> logistic regression functions in Mahout don't actually expose quite as much
> information as you need for this, but if you add this method, you should
> get the basic information you need:
>
>        /**
>   * Return the element-wise product of the feature vector versus each
> column
>   * of the beta matrix.  This can then be used to extract the most
> interesting
>   * features for a decision for each alternative output.
>   * @param instance  A feature vector
>   * @return   A matrix like beta but with each column multiplied by
> instance.
>   */
>  public Matrix explain(Vector instance) {
>    regularize(instance);
>    Matrix r = beta.like().assign(beta);
>    for (int column = 0; column < r.columnSize(); column++) {
>      r.viewColumn(column).assign(instance, Functions.MULT);
>    }
>    return r;
>  }
>
>
> Then to explain your binary model, you probably want some code like this:
>
>   Map<String, Set<Integer>> traceDictionary = Maps.newHashSet();
>   Vector instance = encode(data, traceDictionary)
>   Matrix b = model.explain(instance);
>
>   ModelDissector md = new ModelDissector();
>   // get positive terms
>   ModelDissector.update(b.getColumn(0), td, model);
>   // scan through the top terms
>   ...
>
>   md = new ModelDissector();
>   ModelDissector.update(b.getColumn(0).assign(Functions.NEGATE), td,
> model);
>   // scan through the most negative terms
>   ...
>
> Note that all of this code is untested and I could be out to lunch here.
>
>
>
>
> On Thu, Nov 3, 2011 at 12:19 PM, David Rahman <drahman1985@googlemail.com
> >wrote:
>
> > Hi Ted,
> >
> > I want to have the model explain why it classified documents in a certain
> > way. That should be enough at first.
> >
> > I want to classify documents, each document has a corresponding set of
> > keywords. The model should be able to classify unknown documents and
> > provide a number of suggustions of keywords. Later on it should be
> possible
> > to build a search term recommender for a search engine with classified
> > documents as a basis.
> >
> > At first we wanted to use the lucene data, but the existing data is build
> > with an older lucene version, so the data is provided in xml, for now.
> It's
> > like the wikipedia example, only with more possible keywords.
> >
> > Hope it's understandable.
> >
> > Thanks for your endurance and regards,
> > David
> >
> > 2011/11/3 Ted Dunning <te...@gmail.com>
> >
> > > I am sorry for being dense, but I don't really understand what you are
> > > trying to do.
> > >
> > > As I see it,
> > >
> > > - the input is documents
> > >
> > > - the output is a category
> > >
> > > You want one or more of the following,
> > >
> > > - to have the model explain why it classified documents a certain way
> > >
> > > or
> > >
> > > - to classify non-document phrases a certain way
> > >
> > > or
> > >
> > > - to have the model show its internal structure to you
> > >
> > > or
> > >
> > > - something else entirely
> > >
> > > Can you say what you want in these terms?
> > >
> > > On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <
> drahman1985@googlemail.com
> > > >wrote:
> > >
> > > > Hi Ted,
> > > >
> > > > thank you for the explanation.
> > > > For example imagine a term cloud, in which terms are presented. Some
> > > terms
> > > > are bigger than other, because they are more likely than the other
> > > terms. I
> > > > would need those results for analysis. We want to compare different
> > > > ML-algorithms and methods and/or compinations of them. And first I
> have
> > > to
> > > > gain some basic knowledge about Mahout.
> > > >
> > > > For example, when I take the word 'social' as input I'd like to have
> > that
> > > > result:
> > > >
> > > > social                    1.0
> > > > social media           0.8
> > > > social networking    0.65
> > > > social news            0.6
> > > > facebook                0.5
> > > > ...
> > > >
> > > > (ignore those values, it's not correct, but it should show what I
> need)
> > > >
> > > > The 20Newsgroup-example shows with the summary(int n) method the most
> > > > likely categorisation of a term (--> the most important feature). I
> > would
> > > > like to have a list with the second, third, and so on important
> > feature.
> > > I
> > > > imagine, while computing the features, only the most import ones are
> > > added
> > > > to the list and the less important features are rejected.
> > > >
> > > > Thanks and regards,
> > > > David
> > > >
> > > > 2011/11/3 Ted Dunning <te...@gmail.com>
> > > >
> > > > > There are no confidence values per se in the models computed by
> > Mahout
> > > at
> > > > > this time.
> > > > >
> > > > > There are several issues here,
> > > > >
> > > > > 1) Naive Bayes doesn't have such a concept.  'Nuff said there.
> > > > >
> > > > > 2) SGD logistic regresssion could compute confidence intervals,
> but I
> > > am
> > > > > not quite sure how to do that with stochastic gradient descent.
> > > > >
> > > > > 3) in most uses of Mahout's logistic regression, the issues are
> data
> > > size
> > > > > and feature set size.  Confidence values are typically used for
> > > selecting
> > > > > features which is typically not a viable strategy for problems with
> > > very
> > > > > large feature sets.  That is what the L1 regularization is all
> about.
> > > > >
> > > > > 4) with an extremely large number features, the noise on confidence
> > > > > intervals makes them very hard to understand
> > > > >
> > > > > 5) with hashed features and feature collisions it is hard enough to
> > > > > understand which feature is doing what, much less what the
> confidence
> > > > > interval means.
> > > > >
> > > > > Can you say more about your problem?  Is it small enough to use
> > > bayesglm
> > > > in
> > > > > R?
> > > > >
> > > > > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <
> > > drahman1985@googlemail.com
> > > > > >wrote:
> > > > >
> > > > > > Me again,
> > > > > >
> > > > > > can someone point me to right direction? How can I access these
> > > > features?
> > > > > > I looked into the summary(int n) -method located in
> > > > > > org.apache.mahout.classifier.sgd.Modeldissector.java, but
> somehow I
> > > > don't
> > > > > > understand how it works.
> > > > > >
> > > > > > Could someone explain to me how it works? As I understand it, it
> > > > returns
> > > > > > just the max-value of a feature.
> > > > > >
> > > > > > Thanks and regards,
> > > > > > David
> > > > > >
> > > > > > 2011/10/20 David Rahman <dr...@googlemail.com>
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > how can I access the confidence values of one (or more)
> > feature(s)
> > > > with
> > > > > > > its possibilities?
> > > > > > >
> > > > > > > In the 20Newsgroup-example, there is the dissect method, within
> > > there
> > > > > is
> > > > > > > used summary(int n), which returns the n most important
> features
> > > with
> > > > > > their
> > > > > > > weights. I want also the features which are placed second or
> > third
> > > > (or
> > > > > > > more). How can I access those?
> > > > > > >
> > > > > > > Regards,
> > > > > > > David
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: confidence values of one (or more) feature(s)

Posted by Ted Dunning <te...@gmail.com>.
OK.

So the simplest design in Mahout terms is a binary classifier for each
keyword (if the keywords are not mutually exclusive).  If you can define a
useful ordering for terms or have some logical entailment, you may want to
allow the presence of some terms to be features for certain other terms.

So the question boils down to how to ask a binary logistic regression how
it came to its conclusion.

You are correct to look to the model dissector for the function you want,
but you will have to call it in a little bit unusual way because it is
really intended to describe a model rather than a single decision.  The
logistic regression functions in Mahout don't actually expose quite as much
information as you need for this, but if you add this method, you should
get the basic information you need:

        /**
   * Return the element-wise product of the feature vector versus each
column
   * of the beta matrix.  This can then be used to extract the most
interesting
   * features for a decision for each alternative output.
   * @param instance  A feature vector
   * @return   A matrix like beta but with each column multiplied by
instance.
   */
  public Matrix explain(Vector instance) {
    regularize(instance);
    Matrix r = beta.like().assign(beta);
    for (int column = 0; column < r.columnSize(); column++) {
      r.viewColumn(column).assign(instance, Functions.MULT);
    }
    return r;
  }


Then to explain your binary model, you probably want some code like this:

   Map<String, Set<Integer>> traceDictionary = Maps.newHashSet();
   Vector instance = encode(data, traceDictionary)
   Matrix b = model.explain(instance);

   ModelDissector md = new ModelDissector();
   // get positive terms
   ModelDissector.update(b.getColumn(0), td, model);
   // scan through the top terms
   ...

   md = new ModelDissector();
   ModelDissector.update(b.getColumn(0).assign(Functions.NEGATE), td,
model);
   // scan through the most negative terms
   ...

Note that all of this code is untested and I could be out to lunch here.




On Thu, Nov 3, 2011 at 12:19 PM, David Rahman <dr...@googlemail.com>wrote:

> Hi Ted,
>
> I want to have the model explain why it classified documents in a certain
> way. That should be enough at first.
>
> I want to classify documents, each document has a corresponding set of
> keywords. The model should be able to classify unknown documents and
> provide a number of suggustions of keywords. Later on it should be possible
> to build a search term recommender for a search engine with classified
> documents as a basis.
>
> At first we wanted to use the lucene data, but the existing data is build
> with an older lucene version, so the data is provided in xml, for now. It's
> like the wikipedia example, only with more possible keywords.
>
> Hope it's understandable.
>
> Thanks for your endurance and regards,
> David
>
> 2011/11/3 Ted Dunning <te...@gmail.com>
>
> > I am sorry for being dense, but I don't really understand what you are
> > trying to do.
> >
> > As I see it,
> >
> > - the input is documents
> >
> > - the output is a category
> >
> > You want one or more of the following,
> >
> > - to have the model explain why it classified documents a certain way
> >
> > or
> >
> > - to classify non-document phrases a certain way
> >
> > or
> >
> > - to have the model show its internal structure to you
> >
> > or
> >
> > - something else entirely
> >
> > Can you say what you want in these terms?
> >
> > On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <drahman1985@googlemail.com
> > >wrote:
> >
> > > Hi Ted,
> > >
> > > thank you for the explanation.
> > > For example imagine a term cloud, in which terms are presented. Some
> > terms
> > > are bigger than other, because they are more likely than the other
> > terms. I
> > > would need those results for analysis. We want to compare different
> > > ML-algorithms and methods and/or compinations of them. And first I have
> > to
> > > gain some basic knowledge about Mahout.
> > >
> > > For example, when I take the word 'social' as input I'd like to have
> that
> > > result:
> > >
> > > social                    1.0
> > > social media           0.8
> > > social networking    0.65
> > > social news            0.6
> > > facebook                0.5
> > > ...
> > >
> > > (ignore those values, it's not correct, but it should show what I need)
> > >
> > > The 20Newsgroup-example shows with the summary(int n) method the most
> > > likely categorisation of a term (--> the most important feature). I
> would
> > > like to have a list with the second, third, and so on important
> feature.
> > I
> > > imagine, while computing the features, only the most import ones are
> > added
> > > to the list and the less important features are rejected.
> > >
> > > Thanks and regards,
> > > David
> > >
> > > 2011/11/3 Ted Dunning <te...@gmail.com>
> > >
> > > > There are no confidence values per se in the models computed by
> Mahout
> > at
> > > > this time.
> > > >
> > > > There are several issues here,
> > > >
> > > > 1) Naive Bayes doesn't have such a concept.  'Nuff said there.
> > > >
> > > > 2) SGD logistic regresssion could compute confidence intervals, but I
> > am
> > > > not quite sure how to do that with stochastic gradient descent.
> > > >
> > > > 3) in most uses of Mahout's logistic regression, the issues are data
> > size
> > > > and feature set size.  Confidence values are typically used for
> > selecting
> > > > features which is typically not a viable strategy for problems with
> > very
> > > > large feature sets.  That is what the L1 regularization is all about.
> > > >
> > > > 4) with an extremely large number features, the noise on confidence
> > > > intervals makes them very hard to understand
> > > >
> > > > 5) with hashed features and feature collisions it is hard enough to
> > > > understand which feature is doing what, much less what the confidence
> > > > interval means.
> > > >
> > > > Can you say more about your problem?  Is it small enough to use
> > bayesglm
> > > in
> > > > R?
> > > >
> > > > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <
> > drahman1985@googlemail.com
> > > > >wrote:
> > > >
> > > > > Me again,
> > > > >
> > > > > can someone point me to right direction? How can I access these
> > > features?
> > > > > I looked into the summary(int n) -method located in
> > > > > org.apache.mahout.classifier.sgd.Modeldissector.java, but somehow I
> > > don't
> > > > > understand how it works.
> > > > >
> > > > > Could someone explain to me how it works? As I understand it, it
> > > returns
> > > > > just the max-value of a feature.
> > > > >
> > > > > Thanks and regards,
> > > > > David
> > > > >
> > > > > 2011/10/20 David Rahman <dr...@googlemail.com>
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > how can I access the confidence values of one (or more)
> feature(s)
> > > with
> > > > > > its possibilities?
> > > > > >
> > > > > > In the 20Newsgroup-example, there is the dissect method, within
> > there
> > > > is
> > > > > > used summary(int n), which returns the n most important features
> > with
> > > > > their
> > > > > > weights. I want also the features which are placed second or
> third
> > > (or
> > > > > > more). How can I access those?
> > > > > >
> > > > > > Regards,
> > > > > > David
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: confidence values of one (or more) feature(s)

Posted by David Rahman <dr...@googlemail.com>.
Hi Ted,

I want to have the model explain why it classified documents in a certain
way. That should be enough at first.

I want to classify documents, each document has a corresponding set of
keywords. The model should be able to classify unknown documents and
provide a number of suggustions of keywords. Later on it should be possible
to build a search term recommender for a search engine with classified
documents as a basis.

At first we wanted to use the lucene data, but the existing data is build
with an older lucene version, so the data is provided in xml, for now. It's
like the wikipedia example, only with more possible keywords.

Hope it's understandable.

Thanks for your endurance and regards,
David

2011/11/3 Ted Dunning <te...@gmail.com>

> I am sorry for being dense, but I don't really understand what you are
> trying to do.
>
> As I see it,
>
> - the input is documents
>
> - the output is a category
>
> You want one or more of the following,
>
> - to have the model explain why it classified documents a certain way
>
> or
>
> - to classify non-document phrases a certain way
>
> or
>
> - to have the model show its internal structure to you
>
> or
>
> - something else entirely
>
> Can you say what you want in these terms?
>
> On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <drahman1985@googlemail.com
> >wrote:
>
> > Hi Ted,
> >
> > thank you for the explanation.
> > For example imagine a term cloud, in which terms are presented. Some
> terms
> > are bigger than other, because they are more likely than the other
> terms. I
> > would need those results for analysis. We want to compare different
> > ML-algorithms and methods and/or compinations of them. And first I have
> to
> > gain some basic knowledge about Mahout.
> >
> > For example, when I take the word 'social' as input I'd like to have that
> > result:
> >
> > social                    1.0
> > social media           0.8
> > social networking    0.65
> > social news            0.6
> > facebook                0.5
> > ...
> >
> > (ignore those values, it's not correct, but it should show what I need)
> >
> > The 20Newsgroup-example shows with the summary(int n) method the most
> > likely categorisation of a term (--> the most important feature). I would
> > like to have a list with the second, third, and so on important feature.
> I
> > imagine, while computing the features, only the most import ones are
> added
> > to the list and the less important features are rejected.
> >
> > Thanks and regards,
> > David
> >
> > 2011/11/3 Ted Dunning <te...@gmail.com>
> >
> > > There are no confidence values per se in the models computed by Mahout
> at
> > > this time.
> > >
> > > There are several issues here,
> > >
> > > 1) Naive Bayes doesn't have such a concept.  'Nuff said there.
> > >
> > > 2) SGD logistic regresssion could compute confidence intervals, but I
> am
> > > not quite sure how to do that with stochastic gradient descent.
> > >
> > > 3) in most uses of Mahout's logistic regression, the issues are data
> size
> > > and feature set size.  Confidence values are typically used for
> selecting
> > > features which is typically not a viable strategy for problems with
> very
> > > large feature sets.  That is what the L1 regularization is all about.
> > >
> > > 4) with an extremely large number features, the noise on confidence
> > > intervals makes them very hard to understand
> > >
> > > 5) with hashed features and feature collisions it is hard enough to
> > > understand which feature is doing what, much less what the confidence
> > > interval means.
> > >
> > > Can you say more about your problem?  Is it small enough to use
> bayesglm
> > in
> > > R?
> > >
> > > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <
> drahman1985@googlemail.com
> > > >wrote:
> > >
> > > > Me again,
> > > >
> > > > can someone point me to right direction? How can I access these
> > features?
> > > > I looked into the summary(int n) -method located in
> > > > org.apache.mahout.classifier.sgd.Modeldissector.java, but somehow I
> > don't
> > > > understand how it works.
> > > >
> > > > Could someone explain to me how it works? As I understand it, it
> > returns
> > > > just the max-value of a feature.
> > > >
> > > > Thanks and regards,
> > > > David
> > > >
> > > > 2011/10/20 David Rahman <dr...@googlemail.com>
> > > >
> > > > > Hi,
> > > > >
> > > > > how can I access the confidence values of one (or more) feature(s)
> > with
> > > > > its possibilities?
> > > > >
> > > > > In the 20Newsgroup-example, there is the dissect method, within
> there
> > > is
> > > > > used summary(int n), which returns the n most important features
> with
> > > > their
> > > > > weights. I want also the features which are placed second or third
> > (or
> > > > > more). How can I access those?
> > > > >
> > > > > Regards,
> > > > > David
> > > > >
> > > >
> > >
> >
>

Re: confidence values of one (or more) feature(s)

Posted by Ted Dunning <te...@gmail.com>.
I am sorry for being dense, but I don't really understand what you are
trying to do.

As I see it,

- the input is documents

- the output is a category

You want one or more of the following,

- to have the model explain why it classified documents a certain way

or

- to classify non-document phrases a certain way

or

- to have the model show its internal structure to you

or

- something else entirely

Can you say what you want in these terms?

On Thu, Nov 3, 2011 at 8:43 AM, David Rahman <dr...@googlemail.com>wrote:

> Hi Ted,
>
> thank you for the explanation.
> For example imagine a term cloud, in which terms are presented. Some terms
> are bigger than other, because they are more likely than the other terms. I
> would need those results for analysis. We want to compare different
> ML-algorithms and methods and/or compinations of them. And first I have to
> gain some basic knowledge about Mahout.
>
> For example, when I take the word 'social' as input I'd like to have that
> result:
>
> social                    1.0
> social media           0.8
> social networking    0.65
> social news            0.6
> facebook                0.5
> ...
>
> (ignore those values, it's not correct, but it should show what I need)
>
> The 20Newsgroup-example shows with the summary(int n) method the most
> likely categorisation of a term (--> the most important feature). I would
> like to have a list with the second, third, and so on important feature. I
> imagine, while computing the features, only the most import ones are added
> to the list and the less important features are rejected.
>
> Thanks and regards,
> David
>
> 2011/11/3 Ted Dunning <te...@gmail.com>
>
> > There are no confidence values per se in the models computed by Mahout at
> > this time.
> >
> > There are several issues here,
> >
> > 1) Naive Bayes doesn't have such a concept.  'Nuff said there.
> >
> > 2) SGD logistic regresssion could compute confidence intervals, but I am
> > not quite sure how to do that with stochastic gradient descent.
> >
> > 3) in most uses of Mahout's logistic regression, the issues are data size
> > and feature set size.  Confidence values are typically used for selecting
> > features which is typically not a viable strategy for problems with very
> > large feature sets.  That is what the L1 regularization is all about.
> >
> > 4) with an extremely large number features, the noise on confidence
> > intervals makes them very hard to understand
> >
> > 5) with hashed features and feature collisions it is hard enough to
> > understand which feature is doing what, much less what the confidence
> > interval means.
> >
> > Can you say more about your problem?  Is it small enough to use bayesglm
> in
> > R?
> >
> > On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <drahman1985@googlemail.com
> > >wrote:
> >
> > > Me again,
> > >
> > > can someone point me to right direction? How can I access these
> features?
> > > I looked into the summary(int n) -method located in
> > > org.apache.mahout.classifier.sgd.Modeldissector.java, but somehow I
> don't
> > > understand how it works.
> > >
> > > Could someone explain to me how it works? As I understand it, it
> returns
> > > just the max-value of a feature.
> > >
> > > Thanks and regards,
> > > David
> > >
> > > 2011/10/20 David Rahman <dr...@googlemail.com>
> > >
> > > > Hi,
> > > >
> > > > how can I access the confidence values of one (or more) feature(s)
> with
> > > > its possibilities?
> > > >
> > > > In the 20Newsgroup-example, there is the dissect method, within there
> > is
> > > > used summary(int n), which returns the n most important features with
> > > their
> > > > weights. I want also the features which are placed second or third
> (or
> > > > more). How can I access those?
> > > >
> > > > Regards,
> > > > David
> > > >
> > >
> >
>

Re: confidence values of one (or more) feature(s)

Posted by David Rahman <dr...@googlemail.com>.
Hi Ted,

thank you for the explanation.
For example imagine a term cloud, in which terms are presented. Some terms
are bigger than other, because they are more likely than the other terms. I
would need those results for analysis. We want to compare different
ML-algorithms and methods and/or compinations of them. And first I have to
gain some basic knowledge about Mahout.

For example, when I take the word 'social' as input I'd like to have that
result:

social                    1.0
social media           0.8
social networking    0.65
social news            0.6
facebook                0.5
...

(ignore those values, it's not correct, but it should show what I need)

The 20Newsgroup-example shows with the summary(int n) method the most
likely categorisation of a term (--> the most important feature). I would
like to have a list with the second, third, and so on important feature. I
imagine, while computing the features, only the most import ones are added
to the list and the less important features are rejected.

Thanks and regards,
David

2011/11/3 Ted Dunning <te...@gmail.com>

> There are no confidence values per se in the models computed by Mahout at
> this time.
>
> There are several issues here,
>
> 1) Naive Bayes doesn't have such a concept.  'Nuff said there.
>
> 2) SGD logistic regresssion could compute confidence intervals, but I am
> not quite sure how to do that with stochastic gradient descent.
>
> 3) in most uses of Mahout's logistic regression, the issues are data size
> and feature set size.  Confidence values are typically used for selecting
> features which is typically not a viable strategy for problems with very
> large feature sets.  That is what the L1 regularization is all about.
>
> 4) with an extremely large number features, the noise on confidence
> intervals makes them very hard to understand
>
> 5) with hashed features and feature collisions it is hard enough to
> understand which feature is doing what, much less what the confidence
> interval means.
>
> Can you say more about your problem?  Is it small enough to use bayesglm in
> R?
>
> On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <drahman1985@googlemail.com
> >wrote:
>
> > Me again,
> >
> > can someone point me to right direction? How can I access these features?
> > I looked into the summary(int n) -method located in
> > org.apache.mahout.classifier.sgd.Modeldissector.java, but somehow I don't
> > understand how it works.
> >
> > Could someone explain to me how it works? As I understand it, it returns
> > just the max-value of a feature.
> >
> > Thanks and regards,
> > David
> >
> > 2011/10/20 David Rahman <dr...@googlemail.com>
> >
> > > Hi,
> > >
> > > how can I access the confidence values of one (or more) feature(s) with
> > > its possibilities?
> > >
> > > In the 20Newsgroup-example, there is the dissect method, within there
> is
> > > used summary(int n), which returns the n most important features with
> > their
> > > weights. I want also the features which are placed second or third (or
> > > more). How can I access those?
> > >
> > > Regards,
> > > David
> > >
> >
>

Re: confidence values of one (or more) feature(s)

Posted by Ted Dunning <te...@gmail.com>.
There are no confidence values per se in the models computed by Mahout at
this time.

There are several issues here,

1) Naive Bayes doesn't have such a concept.  'Nuff said there.

2) SGD logistic regresssion could compute confidence intervals, but I am
not quite sure how to do that with stochastic gradient descent.

3) in most uses of Mahout's logistic regression, the issues are data size
and feature set size.  Confidence values are typically used for selecting
features which is typically not a viable strategy for problems with very
large feature sets.  That is what the L1 regularization is all about.

4) with an extremely large number features, the noise on confidence
intervals makes them very hard to understand

5) with hashed features and feature collisions it is hard enough to
understand which feature is doing what, much less what the confidence
interval means.

Can you say more about your problem?  Is it small enough to use bayesglm in
R?

On Thu, Nov 3, 2011 at 7:25 AM, David Rahman <dr...@googlemail.com>wrote:

> Me again,
>
> can someone point me to right direction? How can I access these features?
> I looked into the summary(int n) -method located in
> org.apache.mahout.classifier.sgd.Modeldissector.java, but somehow I don't
> understand how it works.
>
> Could someone explain to me how it works? As I understand it, it returns
> just the max-value of a feature.
>
> Thanks and regards,
> David
>
> 2011/10/20 David Rahman <dr...@googlemail.com>
>
> > Hi,
> >
> > how can I access the confidence values of one (or more) feature(s) with
> > its possibilities?
> >
> > In the 20Newsgroup-example, there is the dissect method, within there is
> > used summary(int n), which returns the n most important features with
> their
> > weights. I want also the features which are placed second or third (or
> > more). How can I access those?
> >
> > Regards,
> > David
> >
>