You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Robin Anil <ro...@gmail.com> on 2008/03/24 16:07:55 UTC

Regarding Google Summer of Code Lucene Mahout Project

Hi Admins,
                I went through the Google Summer of Code Wiki and found out
about  the mahout-machine-learning project. I wish to participate in
implementing the papers. I am currently working on my Btech Thesis which is
to extract opinionated Sentences from Blogs which is also a part of Text
Retrieval Conference TREC 2008  Blog
Track<http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG/#head-9dd52f8791e8d7ba62f3bdd63932e0ec04e83ac8>under
the guidance of Prof.
Sudeshna Sarkar <http://www.facweb.iitkgp.ernet.in/%7Esudeshna>. For
implementing of my Trec System, I have experimented with Classifiers( NB,
SVM, Decision Trees) and Clustering Algorithms( k-means, and Gaussian
Mixtures). For the project i had used C# version of Lucene (Lucene.NET) to
index and Retrieve Documents in the Blog06
Collection<http://ir.dcs.gla.ac.uk/test_collections/blog06info.html>(160GB).
I believe working on this project would aid me to further improve the
performance and the efficiency of the system i am working on as well as ease
me in working with the open source community.

I am a 4th year CS Student of IIT Kharagpur working towards a Dual Degree (
B.Tech + M.Tech). And this would be the first time working with an
Open-Source project. Could you suggest me the things I should get
comfortable with in implementing this as well as the detail you require in
the proposal for implementation

Robin

Re: Regarding Google Summer of Code Lucene Mahout Project

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 24, 2008, at 11:07 AM, Robin Anil wrote:

> Hi Admins,
>                I went through the Google Summer of Code Wiki and  
> found out
> about  the mahout-machine-learning project. I wish to participate in
> implementing the papers. I am currently working on my Btech Thesis  
> which is
> to extract opinionated Sentences from Blogs which is also a part of  
> Text
> Retrieval Conference TREC 2008  Blog
> Track<http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG/#head-9dd52f8791e8d7ba62f3bdd63932e0ec04e83ac8 
> >under
> the guidance of Prof.
> Sudeshna Sarkar <http://www.facweb.iitkgp.ernet.in/%7Esudeshna>. For
> implementing of my Trec System, I have experimented with  
> Classifiers( NB,
> SVM, Decision Trees) and Clustering Algorithms( k-means, and Gaussian
> Mixtures). For the project i had used C# version of Lucene  
> (Lucene.NET) to
> index and Retrieve Documents in the Blog06
> Collection<http://ir.dcs.gla.ac.uk/test_collections/ 
> blog06info.html>(160GB).
> I believe working on this project would aid me to further improve the
> performance and the efficiency of the system i am working on as well  
> as ease
> me in working with the open source community.
>
> I am a 4th year CS Student of IIT Kharagpur working towards a Dual  
> Degree (
> B.Tech + M.Tech). And this would be the first time working with an
> Open-Source project. Could you suggest me the things I should get
> comfortable with in implementing this as well as the detail you  
> require in
> the proposal for implementation


I'd have a look at the wiki and the NIPS paper listed there, and also  
search the archives for GSOC discussions.  I'd also start looking into  
Hadoop and the existing code we have.  Then, just go ahead and make a  
proposal.  I'm particularly interested in classifiers, but I know  
there is a good deal of interest in clustering too (we already have a  
k-means impl).  For classifiers, I am slowly, but surely, working on a  
naive bayes implementation (time is always a question for me), thus,  
implementing decision trees or SVM would be really cool.

Cheers,
Grant

Re: Regarding Google Summer of Code Lucene Mahout Project

Posted by Robin Anil <ro...@gmail.com>.

Hi,
     I have submitted my proposal for implementing Complementary Naive
Bayes. Along with it, I have proposed to implement an EM algorithm to
help learn unlabelled data.  More explanation of the method is found
in this paper. http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/conll03.pdf.
My Deliverables include both Algorithms and 2 Demos using 20Newsgroups
Dataset.

PS: This is not an April Fools Proposal :)

Happy Fools Day Everyone.
Robin

Re: Regarding Google Summer of Code Lucene Mahout Project

Posted by Jason Rennie <jr...@gmail.com>.

On Tue, Mar 25, 2008 at 3:49 PM, Isabel Drost <ap...@isabel-drost.de>
wrote:

> The paper looks interesting: The modifications to naive bayes presented in
> the
> paper seem to lead to a classifier that is comparable to SVM performance
> for
> text classification while having far better performance.

The text transforms aren't particularly interesting---they're essentially
just a TFIDF pre-processing step.  This botches the multinomial model a bit,
but works well and has some theoretical motivation (in sec. 4).

I thought the discovery of the "skewed data bias" was quite interesting.
Was never completely satisfied with our solution, Complement Naive Bayes.
But, it did seem to solve the problem and yield improved performance for
data where the number of examples per class varied widely.

Seems like you need an algorithm that outputs comparable scores for each
> document and is neither under- nor overconfident. I remember vaguely that
> the
> vanilla NB had some problems in this respect.

Complement NB gets rid of some of this problem, though Logistic Regression
or Softmax (the multiclass variant) is probably a generally better
solution.  'course LR/Softmax requires optimization whereas (C)NB requires
little more than counting and basic math ops...  easier to implement...

Jason

-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

Re: Regarding Google Summer of Code Lucene Mahout Project

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Tuesday 25 March 2008, Robin Anil wrote:
> You may be interested in reading the paper which talks more about it Here
> <http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf>.

The paper looks interesting: The modifications to naive bayes presented in the 
paper seem to lead to a classifier that is comparable to SVM performance for 
text classification while having far better performance.

> the feature selection module is overloaded for each of them.

Sounds reasonable to me. I would guess the feature selection module is 
independent of the classifier?

> > So what you are hoping for is a system that can crawl and answer queries
> > at the same time, integrating more and more information as it becomes
> > available, right?
>
> No because the queries arent fixed. If you disregard the TREC queries, say
> a person is sitting there asking for opinion about a target. He may type
> "Nokia 6600" or "My left hand". Now, I would have to go though the DB and
> find everything which talks about Nokia and the other and do post
> processing if its not yet processed.

I see - you want to do the sentiment classification step at query time and 
therefore you need it to be efficient. This implies that you need to store 
each text unit (say each blog posting) either in clear text or as some 
general feature vector (depends on whether your features are query dependant 
or not) and do the classification at query time.

> Another reason is the ranking of the results become a problem. How do i say
> which among the 1000 results gives the better opinion. The doc that talks
> more about the target or the one which has more opinions about the target.
> Neither, we need to rank them based on the output of Classification
> Algorithms. 

Seems like you need an algorithm that outputs comparable scores for each 
document and is neither under- nor overconfident. I remember vaguely that the 
vanilla NB had some problems in this respect.

Isabel

-- 
The most important design issue... is the fact that Linux is supposed to be 
fun...		-- Linus Torvalds at the First Dutch International Symposium on Linux
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Regarding Google Summer of Code Lucene Mahout Project

Posted by Matthew Riley <mr...@gmail.com>.

Maybe you can get some feedback from Jason Rennie (one of the authors of the
paper you linked) on your implementation - I seem to remember seeing some
comments from him on this mailing list about a week ago.

Matt

On Tue, Mar 25, 2008 at 8:55 AM, Robin Anil <ro...@gmail.com> wrote:

> Hi Isabel,
>
> On Tue, Mar 25, 2008 at 2:52 AM, Isabel Drost <
> apache_mahout@isabel-drost.de>
> wrote:
> >
> > On Monday 24 March 2008, Robin Anil wrote:
> >
> > > The Complement-Naive-Bayes-Classifier(coded up for this project) then
> run on
> > > the retrieved document to do post processing.
> >
> > The ideas presented in the slides look pretty interesting to me. Could
> you
> > please provide some pointers to information in the Complement Naive
> Bayes
> > Classifier? What were the reasons you chose this classifier?
> >
> Before going into Complement Naive Bayes there are certain things about
> Text
> Classification. Given a good amount of data as it is in the case of
> textual
> Data, Naive Bayes Suprisingly performs better than most of the other
> supervised learners. Reason as i see it is, Naive Bayes class margins are
> so
> bluntly defined that chances of overfitting is rare. This is also the
> reason
> why, given the proper features Naive Bayes doesnt measure up to other
> Methods. So you may say Naive Bayes in a Good Classifier for Textual Data.
> Now Complement Naive Bayes does the reverse. Instead of calculating which
> class fits the document best. It does, which complement class least fits
> the
> document.  Also it removes the bias problem due to prior probability term
> in
> NB equation. You may be interested in reading the paper which talks more
> about it Here <http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf>.
> My
> BaseClassifier implementation reproduces the work there. But for different
> classifiers (SpamDetection, Subjectivity, Polarity) , all of them inherits
> the base classifier but the feature selection module is overloaded for
> each
> of them.
>
> As you can see all of them except Polarity(Classes are Pos, Neg, Neutral)
> are Binary Classifiers where the CNB is Exactly the same as NB(just a -ve
> sign difference). But other things like normalization made a lot of
> difference in removing the false positives and biased classes.
>
> >
> >
> > > If its possible to have the classifier run along with Lucene and
> > > spit out sentences and add them to a field in real-time, It would
> > > essentially enable this system to be online and allow for real-time
> > > queries.
> >
> > So what you are hoping for is a system that can crawl and answer queries
> at
> > the same time, integrating more and more information as it becomes
> available,
> > right?
> >
> Yes and No,
> Yes because System needs to go through the index get documents and process
> the Sentences and get all opinions, Not necessarity the Target.
> No because the queries arent fixed. If you disregard the TREC queries, say
> a
> person is sitting there asking for opinion about a target. He may type
> "Nokia 6600" or "My left hand". Now, I would have to go though the DB and
> find everything which talks about Nokia and the other and do post
> processing
> if its not yet processed. Another reason is the ranking of the results
> become a problem. How do i say which among the 1000 results gives the
> better
> opinion. The doc that talks more about the target or the one which has
> more
> opinions about the target. Neither, we need to rank them based on the
> output
> of Classification Algorithms.
>
> This is where i see the use of Mahout. Say we have the core Lucene
> Architecture modded with Mahout. If i can give the results of Mahout
> Classifier to lucene for Ranking function. Based on Subjectivity, Polarity
> etc. Not only will it become easy to Implement Good IR Systems for
> Research.
> It can give rise to Some real funky use cases for Complex Production IR
> Systems.
> >
> > > I would gladly answer any queries except results
> >
> > Hmm, so for this competition there is no sample dataset available to
> test
> the
> > performance of the algorithms against? Sounds like there is no way to
> > determine which of two competing solutions is better except making two
> > submissions...
> >
> Well throughout the year, Competing researchers give One or two Queries
> and
> Hand Made results. Which is compiled and tested against each other.
>
> > Isabel
> >
> >
> > --
> > The ideal voice for radio may be defined as showing no substance, no
> sex,no
> > owner, and a message of importance for every housewife.         -- Harry
> V. Wade
> >
> >
> >
> >  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> >  /,`.-'`'    -.  ;-;;,_
> >  |,4-  ) )-,_..;\ (  `'-'
> > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> >
>
>
>
> --
> Robin Anil
> 4th Year Dual Degree Student
> Department of Computer Science & Engineering
> IIT Kharagpur
>
>
> --------------------------------------------------------------------------------------------
> techdigger.wordpress.com
> A discursive take on the world around us
>
> www.minekey.com
> You Might Like This
>
> www.ithink.com
> Express Yourself
>

Re: Regarding Google Summer of Code Lucene Mahout Project

Posted by Robin Anil <ro...@gmail.com>.

Hi Isabel,

On Tue, Mar 25, 2008 at 2:52 AM, Isabel Drost <ap...@isabel-drost.de>
wrote:
>
> On Monday 24 March 2008, Robin Anil wrote:
>
> > The Complement-Naive-Bayes-Classifier(coded up for this project) then
run on
> > the retrieved document to do post processing.
>
> The ideas presented in the slides look pretty interesting to me. Could you
> please provide some pointers to information in the Complement Naive Bayes
> Classifier? What were the reasons you chose this classifier?
>
Before going into Complement Naive Bayes there are certain things about Text
Classification. Given a good amount of data as it is in the case of textual
Data, Naive Bayes Suprisingly performs better than most of the other
supervised learners. Reason as i see it is, Naive Bayes class margins are so
bluntly defined that chances of overfitting is rare. This is also the reason
why, given the proper features Naive Bayes doesnt measure up to other
Methods. So you may say Naive Bayes in a Good Classifier for Textual Data.
Now Complement Naive Bayes does the reverse. Instead of calculating which
class fits the document best. It does, which complement class least fits the
document.  Also it removes the bias problem due to prior probability term in
NB equation. You may be interested in reading the paper which talks more
about it Here <http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf>. My
BaseClassifier implementation reproduces the work there. But for different
classifiers (SpamDetection, Subjectivity, Polarity) , all of them inherits
the base classifier but the feature selection module is overloaded for each
of them.

As you can see all of them except Polarity(Classes are Pos, Neg, Neutral)
are Binary Classifiers where the CNB is Exactly the same as NB(just a -ve
sign difference). But other things like normalization made a lot of
difference in removing the false positives and biased classes.

>
>
> > If its possible to have the classifier run along with Lucene and
> > spit out sentences and add them to a field in real-time, It would
> > essentially enable this system to be online and allow for real-time
> > queries.
>
> So what you are hoping for is a system that can crawl and answer queries
at
> the same time, integrating more and more information as it becomes
available,
> right?
>
Yes and No,
Yes because System needs to go through the index get documents and process
the Sentences and get all opinions, Not necessarity the Target.
No because the queries arent fixed. If you disregard the TREC queries, say a
person is sitting there asking for opinion about a target. He may type
"Nokia 6600" or "My left hand". Now, I would have to go though the DB and
find everything which talks about Nokia and the other and do post processing
if its not yet processed. Another reason is the ranking of the results
become a problem. How do i say which among the 1000 results gives the better
opinion. The doc that talks more about the target or the one which has more
opinions about the target. Neither, we need to rank them based on the output
of Classification Algorithms.

This is where i see the use of Mahout. Say we have the core Lucene
Architecture modded with Mahout. If i can give the results of Mahout
Classifier to lucene for Ranking function. Based on Subjectivity, Polarity
etc. Not only will it become easy to Implement Good IR Systems for Research.
It can give rise to Some real funky use cases for Complex Production IR
Systems.
>
> > I would gladly answer any queries except results
>
> Hmm, so for this competition there is no sample dataset available to test
the
> performance of the algorithms against? Sounds like there is no way to
> determine which of two competing solutions is better except making two
> submissions...
>
Well throughout the year, Competing researchers give One or two Queries and
Hand Made results. Which is compiled and tested against each other.

> Isabel
>
>
> --
> The ideal voice for radio may be defined as showing no substance, no
sex,no
> owner, and a message of importance for every housewife.         -- Harry
V. Wade
>
>
>
>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>  /,`.-'`'    -.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
>



-- 
Robin Anil
4th Year Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur

--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Re: Regarding Google Summer of Code Lucene Mahout Project

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Monday 24 March 2008, Robin Anil wrote:
> The Complement-Naive-Bayes-Classifier(coded up for this project) then run on 
> the retrieved document to do post processing.

The ideas presented in the slides look pretty interesting to me. Could you 
please provide some pointers to information in the Complement Naive Bayes 
Classifier? What were the reasons you chose this classifier?

> If its possible to have the classifier run along with Lucene and
> spit out sentences and add them to a field in real-time, It would
> essentially enable this system to be online and allow for real-time
> queries.

So what you are hoping for is a system that can crawl and answer queries at 
the same time, integrating more and more information as it becomes available, 
right?

> I would gladly answer any queries except results

Hmm, so for this competition there is no sample dataset available to test the 
performance of the algorithms against? Sounds like there is no way to 
determine which of two competing solutions is better except making two 
submissions...

Isabel

-- 
The ideal voice for radio may be defined as showing no substance, no sex,no 
owner, and a message of importance for every housewife.		-- Harry V. Wade
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Regarding Google Summer of Code Lucene Mahout Project

Posted by Robin Anil <ro...@gmail.com>.

Hi Isabel,
             I had used the C# platform to work on the project. I am
attaching a presentation which I used in my last thesis review. It doesn't
contain any results at the moment. The Complete project is done on C# in a
single application. The indexed documents are searched for the keyword. The
top 1000 documents are retrieved using a modified lucene search. The
Complement-Naive-Bayes-Classifier(coded up for this project) then run on the
retrieved document to do post processing. For each TREC query the search
takes few hundred milliseconds. The document retrieval and classifiers take
another 100 ms per document(HTML parsing + Tokenising + Classification). The
classifier models are loaded in memory when the application starts.

Right now the classifiers work in the post processing stage after document
retrieval. If its possible to have the classifier run along with Lucene and
spit out sentences and add them to a field in real-time, It would
essentially enable this system to be online and allow for real-time queries.
This is primarily the reason why I am interested in working for this
project. Please go through the presentation. I would gladly answer any
queries except results( :D which i cannot evaluate till the TREC runs take
place in September 2008)

Robin

Re: Regarding Google Summer of Code Lucene Mahout Project

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Monday 24 March 2008, Robin Anil wrote:
> I am currently working on my Btech Thesis which is to extract opinionated
> Sentences from Blogs which is also a part of Text Retrieval Conference TREC
> 2008  Blog Track under the guidance of Prof. Sudeshna Sarkar

Can you tell us a little more about your approach to this problem? I am 
interested in the exact setup you used to tackle this problem. Maybe some of 
your experiences can be reused to build a powerful demo application for 
Mahout code.

> Could you suggest me the things I should get
> comfortable with in implementing this as well as the detail you require in
> the proposal for implementation

In addition to what Grant already suggested you could have a look into the 
Mahout mail archives or into the discussions in JIRA to find out about what 
is going on in our project and which proposals have been discussed so far.

Isabel

-- 
Violence is a sword that has no handle -- you have to hold the blade.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>