You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Fernando Santos <fe...@gmail.com> on 2013/12/06 19:14:40 UTC

SVM Implementation for mahout?

Hello,

Is there any tested SVM implementation for Mahout?

Mahout in action says there is a sequential implementation, but
"Experimental still". I couldn't find this implementation.

Thanks

-- 
Fernando Santos
+55 61 8129 8505

Re: SVM Implementation for mahout?

Posted by Ted Dunning <te...@gmail.com>.
On Sun, Dec 8, 2013 at 5:50 PM, Fernando Santos <
fernandoleandro1991@gmail.com> wrote:

> Actually I had never heard of PCA and LDA. I'll take a look on it.
>

PCA and LDA are probably not quite what you want for Naive Bayes,
especially in Mahout.  There is an assumption of a sparse binary
representation for data.

Re: SVM Implementation for mahout?

Posted by Fernando Santos <fe...@gmail.com>.
Hello,

I think my problem is related to the fact that the dataset is really
unbalanced. My 3 classes distribution are 550k, 150k and 70k. And
naivebayes make its classification also based on the probability of a class
c over all documents. So probably this unbalance is making a big difference.

Lucas, I'm just using the pre-processing available through seq2sparse.
Which is defining a minimum word frequency, and also a max document
frequency percentage (which work as a stoplist). And yes, I'm using the
tf-idf vectors for training and test.

Actually I had never heard of PCA and LDA. I'll take a look on it.

Thanks


2013/12/8 Lucas Fernandes Brunialti <lb...@igcorp.com.br>

> Hi,
>
> Fernando, to get a better understanding of correlation, you could think of
> features as events in probability, then if the probability of the
> intersection is high, the events are high correlated...
>
> I agree with Ted. But usually, naive bayes  works well with text
> classification when you have a good pre-processing phase, using pca, tf-idf
> or lda... Are you doing any pre-processing?
> On Dec 8, 2013 3:25 PM, "Ted Dunning" <te...@gmail.com> wrote:
>
> >
> > The problem of correlation of features is clearly present in text, but it
> > is not so clear what the effect will be. For naive bayes this has the
> > effect of making the classifier over confident but it usually still works
> > reasonably well.  For logistic regression without regularization it can
> > cause the learning algorithm to fail (mahout'so logistic regression is
> > regularized, btw).
> >
> > Empirical evidence dominates theory in this situation.
> >
> > Sent from my iPhone
> >
> > > On Dec 8, 2013, at 9:14, Fernando Santos <
> fernandoleandro1991@gmail.com>
> > wrote:
> > >
> > > Now just a theoretical doubt. In a text classification example, what
> > would
> > > it mean to have features that are high correlated?  I mean, in this
> case
> > > our features are basically words, do you have an example of how these
> > > features can not be independant? This concept is not really clear in my
> > > mind...
> >
>



-- 
Fernando Santos
+55 61 8129 8505

Re: SVM Implementation for mahout?

Posted by Lucas Fernandes Brunialti <lb...@igcorp.com.br>.
Hi,

Fernando, to get a better understanding of correlation, you could think of
features as events in probability, then if the probability of the
intersection is high, the events are high correlated...

I agree with Ted. But usually, naive bayes  works well with text
classification when you have a good pre-processing phase, using pca, tf-idf
or lda... Are you doing any pre-processing?
On Dec 8, 2013 3:25 PM, "Ted Dunning" <te...@gmail.com> wrote:

>
> The problem of correlation of features is clearly present in text, but it
> is not so clear what the effect will be. For naive bayes this has the
> effect of making the classifier over confident but it usually still works
> reasonably well.  For logistic regression without regularization it can
> cause the learning algorithm to fail (mahout'so logistic regression is
> regularized, btw).
>
> Empirical evidence dominates theory in this situation.
>
> Sent from my iPhone
>
> > On Dec 8, 2013, at 9:14, Fernando Santos <fe...@gmail.com>
> wrote:
> >
> > Now just a theoretical doubt. In a text classification example, what
> would
> > it mean to have features that are high correlated?  I mean, in this case
> > our features are basically words, do you have an example of how these
> > features can not be independant? This concept is not really clear in my
> > mind...
>

Re: SVM Implementation for mahout?

Posted by Ted Dunning <te...@gmail.com>.
The problem of correlation of features is clearly present in text, but it is not so clear what the effect will be. For naive bayes this has the effect of making the classifier over confident but it usually still works reasonably well.  For logistic regression without regularization it can cause the learning algorithm to fail (mahout'so logistic regression is regularized, btw). 

Empirical evidence dominates theory in this situation. 

Sent from my iPhone

> On Dec 8, 2013, at 9:14, Fernando Santos <fe...@gmail.com> wrote:
> 
> Now just a theoretical doubt. In a text classification example, what would
> it mean to have features that are high correlated?  I mean, in this case
> our features are basically words, do you have an example of how these
> features can not be independant? This concept is not really clear in my
> mind...

Re: SVM Implementation for mahout?

Posted by Fernando Santos <fe...@gmail.com>.
Hello Lucas,

Thanks for the advice. It seems that this patch is a working implementation
of MLP (https://issues.apache.org/jira/browse/MAHOUT-1265). I'll give it a
try.

Have you ever used it? If so, any advices?

Now just a theoretical doubt. In a text classification example, what would
it mean to have features that are high correlated?  I mean, in this case
our features are basically words, do you have an example of how these
features can not be independant? This concept is not really clear in my
mind...

Thanks


2013/12/8 Lucas Fernandes Brunialti <lb...@igcorp.com.br>

> Hello Fernando,
>
> The naive bayes approach makes the assumption that your features are
> independent, if your featurea have a high correlation, naive bayes won't be
> a good choice.
>
> I would advice you to try the neural networks (mlp), it can get a better
> decision surface than logistic regression...
>
> Best.
>
> Lucas.
> On Dec 7, 2013 6:53 PM, "Fernando Santos" <fe...@gmail.com>
> wrote:
>
> > Hello Suneel,
> >
> > I want to check if any better performance is reached with SVM.
> >
> > I've been using naive bayes, but my data is quite unbalanced and
> therefore
> > I'm getting pretty bad results with it. I also tried the complementary
> > naive bayes, but got the same bad results. I read about this difference
> > between NaiveBayes performance of Weka and Mahout implementations and
> maybe
> > that's the cause (
> >
> >
> http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/%3CCABdaxxiJTfV9nhQXxPYd72RRsv-H60Ps13H0PUNd2iNJX70BnA@mail.gmail.com%3E
> > ).
> >
> > I also tried logistic regression and got around 77% accuracy. So maybe
> with
> > SVM it could be better.
> >
> >
> > 2013/12/7 Suneel Marthi <su...@yahoo.com>
> >
> > > Any specific reasons u r looking for an SVM implementation only?
> > > R u sure that those patches r still relevant given the codebase today?
> > >
> > >
> > >
> > >
> > >
> > > On Saturday, December 7, 2013 2:58 PM, Fernando Santos <
> > > fernandoleandro1991@gmail.com> wrote:
> > >
> > > Thanks Manuel.
> > >
> > > It seems that these two (
> > https://issues.apache.org/jira/browse/MAHOUT-334
> > > and https://issues.apache.org/jira/browse/MAHOUT-232) patches might
> > work,
> > > although not in parallel.
> > >
> > > Does anyone has sucessfully used any of these two patches already and
> > could
> > > share some comments about it?
> > >
> > > Thanks
> > >
> > >
> > > 2013/12/6 Manuel Blechschmidt <Ma...@gmx.de>
> > >
> > > > Hi Fernando,
> > > > there are some patches and some discussions:
> > > >
> > > > SVM:
> > > > https://issues.apache.org/jira/browse/MAHOUT-334
> > > > https://issues.apache.org/jira/browse/MAHOUT-232
> > > > https://issues.apache.org/jira/browse/MAHOUT-14
> > > > https://issues.apache.org/jira/browse/MAHOUT-227
> > > >
> > > > /Manuel
> > > >
> > > > On 06.12.2013, at 19:14, Fernando Santos wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > Is there any tested SVM implementation for Mahout?
> > > > >
> > > > > Mahout in action says there is a sequential implementation, but
> > > > > "Experimental still". I couldn't find this implementation.
> > > > >
> > > > > Thanks
> > > > >
> > > > > --
> > > > > Fernando Santos
> > > > > +55 61 8129 8505
> > > >
> > > > --
> > > > Manuel Blechschmidt
> > > > M.Sc. IT Systems Engineering
> > > > Dortustr. 57
> > > > 14467 Potsdam
> > > > Mobil: 0173/6322621
> > > > Twitter: http://twitter.com/Manuel_B
> > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Fernando Santos
> > > +55 61 8129 8505
> > >
> >
> >
> >
> > --
> > Fernando Santos
> > +55 61 8129 8505
> >
>



-- 
Fernando Santos
+55 61 8129 8505

Re: SVM Implementation for mahout?

Posted by Lucas Fernandes Brunialti <lb...@igcorp.com.br>.
Hello Fernando,

The naive bayes approach makes the assumption that your features are
independent, if your featurea have a high correlation, naive bayes won't be
a good choice.

I would advice you to try the neural networks (mlp), it can get a better
decision surface than logistic regression...

Best.

Lucas.
On Dec 7, 2013 6:53 PM, "Fernando Santos" <fe...@gmail.com>
wrote:

> Hello Suneel,
>
> I want to check if any better performance is reached with SVM.
>
> I've been using naive bayes, but my data is quite unbalanced and therefore
> I'm getting pretty bad results with it. I also tried the complementary
> naive bayes, but got the same bad results. I read about this difference
> between NaiveBayes performance of Weka and Mahout implementations and maybe
> that's the cause (
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/%3CCABdaxxiJTfV9nhQXxPYd72RRsv-H60Ps13H0PUNd2iNJX70BnA@mail.gmail.com%3E
> ).
>
> I also tried logistic regression and got around 77% accuracy. So maybe with
> SVM it could be better.
>
>
> 2013/12/7 Suneel Marthi <su...@yahoo.com>
>
> > Any specific reasons u r looking for an SVM implementation only?
> > R u sure that those patches r still relevant given the codebase today?
> >
> >
> >
> >
> >
> > On Saturday, December 7, 2013 2:58 PM, Fernando Santos <
> > fernandoleandro1991@gmail.com> wrote:
> >
> > Thanks Manuel.
> >
> > It seems that these two (
> https://issues.apache.org/jira/browse/MAHOUT-334
> > and https://issues.apache.org/jira/browse/MAHOUT-232) patches might
> work,
> > although not in parallel.
> >
> > Does anyone has sucessfully used any of these two patches already and
> could
> > share some comments about it?
> >
> > Thanks
> >
> >
> > 2013/12/6 Manuel Blechschmidt <Ma...@gmx.de>
> >
> > > Hi Fernando,
> > > there are some patches and some discussions:
> > >
> > > SVM:
> > > https://issues.apache.org/jira/browse/MAHOUT-334
> > > https://issues.apache.org/jira/browse/MAHOUT-232
> > > https://issues.apache.org/jira/browse/MAHOUT-14
> > > https://issues.apache.org/jira/browse/MAHOUT-227
> > >
> > > /Manuel
> > >
> > > On 06.12.2013, at 19:14, Fernando Santos wrote:
> > >
> > > > Hello,
> > > >
> > > > Is there any tested SVM implementation for Mahout?
> > > >
> > > > Mahout in action says there is a sequential implementation, but
> > > > "Experimental still". I couldn't find this implementation.
> > > >
> > > > Thanks
> > > >
> > > > --
> > > > Fernando Santos
> > > > +55 61 8129 8505
> > >
> > > --
> > > Manuel Blechschmidt
> > > M.Sc. IT Systems Engineering
> > > Dortustr. 57
> > > 14467 Potsdam
> > > Mobil: 0173/6322621
> > > Twitter: http://twitter.com/Manuel_B
> >
> > >
> > >
> >
> >
> > --
> > Fernando Santos
> > +55 61 8129 8505
> >
>
>
>
> --
> Fernando Santos
> +55 61 8129 8505
>

Re: SVM Implementation for mahout?

Posted by Fernando Santos <fe...@gmail.com>.
Hello Suneel,

I want to check if any better performance is reached with SVM.

I've been using naive bayes, but my data is quite unbalanced and therefore
I'm getting pretty bad results with it. I also tried the complementary
naive bayes, but got the same bad results. I read about this difference
between NaiveBayes performance of Weka and Mahout implementations and maybe
that's the cause (
http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/%3CCABdaxxiJTfV9nhQXxPYd72RRsv-H60Ps13H0PUNd2iNJX70BnA@mail.gmail.com%3E
).

I also tried logistic regression and got around 77% accuracy. So maybe with
SVM it could be better.


2013/12/7 Suneel Marthi <su...@yahoo.com>

> Any specific reasons u r looking for an SVM implementation only?
> R u sure that those patches r still relevant given the codebase today?
>
>
>
>
>
> On Saturday, December 7, 2013 2:58 PM, Fernando Santos <
> fernandoleandro1991@gmail.com> wrote:
>
> Thanks Manuel.
>
> It seems that these two (https://issues.apache.org/jira/browse/MAHOUT-334
> and https://issues.apache.org/jira/browse/MAHOUT-232) patches might work,
> although not in parallel.
>
> Does anyone has sucessfully used any of these two patches already and could
> share some comments about it?
>
> Thanks
>
>
> 2013/12/6 Manuel Blechschmidt <Ma...@gmx.de>
>
> > Hi Fernando,
> > there are some patches and some discussions:
> >
> > SVM:
> > https://issues.apache.org/jira/browse/MAHOUT-334
> > https://issues.apache.org/jira/browse/MAHOUT-232
> > https://issues.apache.org/jira/browse/MAHOUT-14
> > https://issues.apache.org/jira/browse/MAHOUT-227
> >
> > /Manuel
> >
> > On 06.12.2013, at 19:14, Fernando Santos wrote:
> >
> > > Hello,
> > >
> > > Is there any tested SVM implementation for Mahout?
> > >
> > > Mahout in action says there is a sequential implementation, but
> > > "Experimental still". I couldn't find this implementation.
> > >
> > > Thanks
> > >
> > > --
> > > Fernando Santos
> > > +55 61 8129 8505
> >
> > --
> > Manuel Blechschmidt
> > M.Sc. IT Systems Engineering
> > Dortustr. 57
> > 14467 Potsdam
> > Mobil: 0173/6322621
> > Twitter: http://twitter.com/Manuel_B
>
> >
> >
>
>
> --
> Fernando Santos
> +55 61 8129 8505
>



-- 
Fernando Santos
+55 61 8129 8505

Re: SVM Implementation for mahout?

Posted by Suneel Marthi <su...@yahoo.com>.
Any specific reasons u r looking for an SVM implementation only?  
R u sure that those patches r still relevant given the codebase today?





On Saturday, December 7, 2013 2:58 PM, Fernando Santos <fe...@gmail.com> wrote:
 
Thanks Manuel.

It seems that these two (https://issues.apache.org/jira/browse/MAHOUT-334
and https://issues.apache.org/jira/browse/MAHOUT-232) patches might work,
although not in parallel.

Does anyone has sucessfully used any of these two patches already and could
share some comments about it?

Thanks


2013/12/6 Manuel Blechschmidt <Ma...@gmx.de>

> Hi Fernando,
> there are some patches and some discussions:
>
> SVM:
> https://issues.apache.org/jira/browse/MAHOUT-334
> https://issues.apache.org/jira/browse/MAHOUT-232
> https://issues.apache.org/jira/browse/MAHOUT-14
> https://issues.apache.org/jira/browse/MAHOUT-227
>
> /Manuel
>
> On 06.12.2013, at 19:14, Fernando Santos wrote:
>
> > Hello,
> >
> > Is there any tested SVM implementation for Mahout?
> >
> > Mahout in action says there is a sequential implementation, but
> > "Experimental still". I couldn't find this implementation.
> >
> > Thanks
> >
> > --
> > Fernando Santos
> > +55 61 8129 8505
>
> --
> Manuel Blechschmidt
> M.Sc. IT Systems Engineering
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B

>
>


-- 
Fernando Santos
+55 61 8129 8505

Re: SVM Implementation for mahout?

Posted by Fernando Santos <fe...@gmail.com>.
Thanks Manuel.

It seems that these two (https://issues.apache.org/jira/browse/MAHOUT-334
 and https://issues.apache.org/jira/browse/MAHOUT-232) patches might work,
although not in parallel.

Does anyone has sucessfully used any of these two patches already and could
share some comments about it?

Thanks


2013/12/6 Manuel Blechschmidt <Ma...@gmx.de>

> Hi Fernando,
> there are some patches and some discussions:
>
> SVM:
> https://issues.apache.org/jira/browse/MAHOUT-334
> https://issues.apache.org/jira/browse/MAHOUT-232
> https://issues.apache.org/jira/browse/MAHOUT-14
> https://issues.apache.org/jira/browse/MAHOUT-227
>
> /Manuel
>
> On 06.12.2013, at 19:14, Fernando Santos wrote:
>
> > Hello,
> >
> > Is there any tested SVM implementation for Mahout?
> >
> > Mahout in action says there is a sequential implementation, but
> > "Experimental still". I couldn't find this implementation.
> >
> > Thanks
> >
> > --
> > Fernando Santos
> > +55 61 8129 8505
>
> --
> Manuel Blechschmidt
> M.Sc. IT Systems Engineering
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B
>
>


-- 
Fernando Santos
+55 61 8129 8505

Re: SVM Implementation for mahout?

Posted by Manuel Blechschmidt <Ma...@gmx.de>.
Hi Fernando,
there are some patches and some discussions:

SVM:
https://issues.apache.org/jira/browse/MAHOUT-334
https://issues.apache.org/jira/browse/MAHOUT-232
https://issues.apache.org/jira/browse/MAHOUT-14
https://issues.apache.org/jira/browse/MAHOUT-227

/Manuel

On 06.12.2013, at 19:14, Fernando Santos wrote:

> Hello,
> 
> Is there any tested SVM implementation for Mahout?
> 
> Mahout in action says there is a sequential implementation, but
> "Experimental still". I couldn't find this implementation.
> 
> Thanks
> 
> -- 
> Fernando Santos
> +55 61 8129 8505

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B