You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Manoj Kumar <ma...@gmail.com> on 2011/03/01 07:51:11 UTC

Re: LDA Mahout

Hi Jeff Eastman,
Is there any options to perform stopwords removal while performing LDA in
mahout or while creating sequence files from the corpus?
Kindly reply.

Thanks & Regards,
Manoj Kumar.R.K
Graduate Student, MS Computer Science
University at Buffalo
Buffalo, New York
(413) 461-8938|www.rkmanojkumar.co.nr



On Mon, Feb 28, 2011 at 1:06 PM, Manoj Kumar <ma...@gmail.com> wrote:

> Hi Jeff Eastman,
>
> Thanks a lot. I ll look into it and will contact you in case of any help.
>
> Thanks & Regards,
> Manoj Kumar.R.K
> Graduate Student, MS Computer Science
> University at Buffalo
> Buffalo, New York
> (413) 461-8938|www.rkmanojkumar.co.nr
>
>
>
> On Mon, Feb 28, 2011 at 12:48 PM, Jeff Eastman <je...@narus.com> wrote:
>
>> Look at examples/bin/build-reuters.sh for some examples. They are all from
>> the command line but illustrate the best way to do what you are attempting.
>> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clusteringalso has some example code for doing text processing.
>>
>> -----Original Message-----
>> From: Manoj Kumar [mailto:manoj1987@gmail.com]
>> Sent: Monday, February 28, 2011 9:28 AM
>> To: user@mahout.apache.org
>> Subject: Re: LDA Mahout
>>
>> Hi Jeff Eastman,
>> Thanks for your reply. I looked into the LDADriver Class. But am not sure
>> as
>> how to convert my text documents to Sequence Files and then to
>> SparseVectors
>> for giving input to LDADriver. Can you please help me in this conversion.
>> ALso, is it enough to just call the run method in LDADriver Class with
>> appropriate inputs for modeling the topics?
>>
>> Thanks & Regards,
>> Manoj Kumar.R.K
>> Graduate Student, MS Computer Science
>> University at Buffalo
>> Buffalo, New York
>> (413) 461-8938|www.rkmanojkumar.co.nr
>>
>>
>>
>> On Mon, Feb 28, 2011 at 12:23 PM, Jeff Eastman <je...@narus.com>
>> wrote:
>>
>> > Have you looked at the Java classes that implement LDA? The private
>> > LDADriver.run() method should be made public, but this can be called
>> from
>> > Java in Eclipse (if that is what you mean by "using Eclipse"). You could
>> > also look at the wiki for information on running LDA (
>> >
>> https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
>> > ).
>> >
>> > -----Original Message-----
>> > From: Manoj Kumar [mailto:manoj1987@gmail.com]
>> > Sent: Monday, February 28, 2011 9:09 AM
>> > To: user@mahout.apache.org
>> > Subject: LDA Mahout
>> >
>> > Hi,
>> >
>> > I am doing a project which requires topic modeling of documents using
>> LDA.
>> > I
>> > am planning to implement this using Mahout LDA. I am not able to get any
>> > sample codes for implementing this using Eclipse. Only command line
>> options
>> > where available. Kindly suggest me some tutorial or please provide me
>> some
>> > basic code for implementing LDA. Kindly reply.
>> >
>> > Thanks & Regards,
>> > Manoj Kumar.R.K
>> > Graduate Student, MS Computer Science
>> > University at Buffalo
>> > Buffalo, New York
>> > (413) 461-8938|www.rkmanojkumar.co.nr
>> >
>>
>
>

Re: LDA Mahout

Posted by Manoj Kumar <ma...@gmail.com>.
Thanks for the reply. I ll look into the code of Analyzer.

Thanks & Regards,
Manoj Kumar.R.K
Graduate Student, MS Computer Science
University at Buffalo
Buffalo, New York
(413) 461-8938|www.rkmanojkumar.co.nr



On Tue, Mar 1, 2011 at 3:17 PM, Ted Dunning <te...@gmail.com> wrote:

> It should be very simple to do this.
>
> Manoj, what do you think of looking at the code and suggesting a patch?
>
> On Tue, Mar 1, 2011 at 2:42 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > Not with seq2sparse. We do have some Lucene support which may allow this.
> > Grant?
> >
> > -----Original Message-----
> > From: Manoj Kumar [mailto:manoj1987@gmail.com]
> > Sent: Tuesday, March 01, 2011 12:26 PM
> > To: user@mahout.apache.org
> > Subject: Re: LDA Mahout
> >
> > thanks. But is it possible to provide customized stop words list being
> > loaded from a text file?
> >
> > Thanks & Regards,
> > Manoj Kumar.R.K
> > Graduate Student, MS Computer Science
> > University at Buffalo
> > Buffalo, New York
> > (413) 461-8938|www.rkmanojkumar.co.nr
> >
> >
> >
> > On Tue, Mar 1, 2011 at 11:36 AM, Jeff Eastman <je...@narus.com>
> wrote:
> >
> > > Sure, seq2sparse has -maxDFPercent option which can be used to
> eliminate
> > > high frequency features like stop words. Check out the documentation at
> > >
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
> > > .
> > >
> > > -----Original Message-----
> > > From: Manoj Kumar [mailto:manoj1987@gmail.com]
> > > Sent: Monday, February 28, 2011 10:51 PM
> > > To: user@mahout.apache.org
> > > Subject: Re: LDA Mahout
> > >
> > > Hi Jeff Eastman,
> > > Is there any options to perform stopwords removal while performing LDA
> in
> > > mahout or while creating sequence files from the corpus?
> > > Kindly reply.
> > >
> > > Thanks & Regards,
> > > Manoj Kumar.R.K
> > > Graduate Student, MS Computer Science
> > > University at Buffalo
> > > Buffalo, New York
> > > (413) 461-8938|www.rkmanojkumar.co.nr
> > >
> > >
> > >
> > > On Mon, Feb 28, 2011 at 1:06 PM, Manoj Kumar <ma...@gmail.com>
> > wrote:
> > >
> > > > Hi Jeff Eastman,
> > > >
> > > > Thanks a lot. I ll look into it and will contact you in case of any
> > help.
> > > >
> > > > Thanks & Regards,
> > > > Manoj Kumar.R.K
> > > > Graduate Student, MS Computer Science
> > > > University at Buffalo
> > > > Buffalo, New York
> > > > (413) 461-8938|www.rkmanojkumar.co.nr
> > > >
> > > >
> > > >
> > > > On Mon, Feb 28, 2011 at 12:48 PM, Jeff Eastman <je...@narus.com>
> > > wrote:
> > > >
> > > >> Look at examples/bin/build-reuters.sh for some examples. They are
> all
> > > from
> > > >> the command line but illustrate the best way to do what you are
> > > attempting.
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clusteringalsohassomeexample code for doing text processing.
> > > >>
> > > >> -----Original Message-----
> > > >> From: Manoj Kumar [mailto:manoj1987@gmail.com]
> > > >> Sent: Monday, February 28, 2011 9:28 AM
> > > >> To: user@mahout.apache.org
> > > >> Subject: Re: LDA Mahout
> > > >>
> > > >> Hi Jeff Eastman,
> > > >> Thanks for your reply. I looked into the LDADriver Class. But am not
> > > sure
> > > >> as
> > > >> how to convert my text documents to Sequence Files and then to
> > > >> SparseVectors
> > > >> for giving input to LDADriver. Can you please help me in this
> > > conversion.
> > > >> ALso, is it enough to just call the run method in LDADriver Class
> with
> > > >> appropriate inputs for modeling the topics?
> > > >>
> > > >> Thanks & Regards,
> > > >> Manoj Kumar.R.K
> > > >> Graduate Student, MS Computer Science
> > > >> University at Buffalo
> > > >> Buffalo, New York
> > > >> (413) 461-8938|www.rkmanojkumar.co.nr
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Feb 28, 2011 at 12:23 PM, Jeff Eastman <je...@narus.com>
> > > >> wrote:
> > > >>
> > > >> > Have you looked at the Java classes that implement LDA? The
> private
> > > >> > LDADriver.run() method should be made public, but this can be
> called
> > > >> from
> > > >> > Java in Eclipse (if that is what you mean by "using Eclipse"). You
> > > could
> > > >> > also look at the wiki for information on running LDA (
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
> > > >> > ).
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Manoj Kumar [mailto:manoj1987@gmail.com]
> > > >> > Sent: Monday, February 28, 2011 9:09 AM
> > > >> > To: user@mahout.apache.org
> > > >> > Subject: LDA Mahout
> > > >> >
> > > >> > Hi,
> > > >> >
> > > >> > I am doing a project which requires topic modeling of documents
> > using
> > > >> LDA.
> > > >> > I
> > > >> > am planning to implement this using Mahout LDA. I am not able to
> get
> > > any
> > > >> > sample codes for implementing this using Eclipse. Only command
> line
> > > >> options
> > > >> > where available. Kindly suggest me some tutorial or please provide
> > me
> > > >> some
> > > >> > basic code for implementing LDA. Kindly reply.
> > > >> >
> > > >> > Thanks & Regards,
> > > >> > Manoj Kumar.R.K
> > > >> > Graduate Student, MS Computer Science
> > > >> > University at Buffalo
> > > >> > Buffalo, New York
> > > >> > (413) 461-8938|www.rkmanojkumar.co.nr
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: LDA Mahout

Posted by Ted Dunning <te...@gmail.com>.
It should be very simple to do this.

Manoj, what do you think of looking at the code and suggesting a patch?

On Tue, Mar 1, 2011 at 2:42 PM, Jeff Eastman <je...@narus.com> wrote:

> Not with seq2sparse. We do have some Lucene support which may allow this.
> Grant?
>
> -----Original Message-----
> From: Manoj Kumar [mailto:manoj1987@gmail.com]
> Sent: Tuesday, March 01, 2011 12:26 PM
> To: user@mahout.apache.org
> Subject: Re: LDA Mahout
>
> thanks. But is it possible to provide customized stop words list being
> loaded from a text file?
>
> Thanks & Regards,
> Manoj Kumar.R.K
> Graduate Student, MS Computer Science
> University at Buffalo
> Buffalo, New York
> (413) 461-8938|www.rkmanojkumar.co.nr
>
>
>
> On Tue, Mar 1, 2011 at 11:36 AM, Jeff Eastman <je...@narus.com> wrote:
>
> > Sure, seq2sparse has -maxDFPercent option which can be used to eliminate
> > high frequency features like stop words. Check out the documentation at
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
> > .
> >
> > -----Original Message-----
> > From: Manoj Kumar [mailto:manoj1987@gmail.com]
> > Sent: Monday, February 28, 2011 10:51 PM
> > To: user@mahout.apache.org
> > Subject: Re: LDA Mahout
> >
> > Hi Jeff Eastman,
> > Is there any options to perform stopwords removal while performing LDA in
> > mahout or while creating sequence files from the corpus?
> > Kindly reply.
> >
> > Thanks & Regards,
> > Manoj Kumar.R.K
> > Graduate Student, MS Computer Science
> > University at Buffalo
> > Buffalo, New York
> > (413) 461-8938|www.rkmanojkumar.co.nr
> >
> >
> >
> > On Mon, Feb 28, 2011 at 1:06 PM, Manoj Kumar <ma...@gmail.com>
> wrote:
> >
> > > Hi Jeff Eastman,
> > >
> > > Thanks a lot. I ll look into it and will contact you in case of any
> help.
> > >
> > > Thanks & Regards,
> > > Manoj Kumar.R.K
> > > Graduate Student, MS Computer Science
> > > University at Buffalo
> > > Buffalo, New York
> > > (413) 461-8938|www.rkmanojkumar.co.nr
> > >
> > >
> > >
> > > On Mon, Feb 28, 2011 at 12:48 PM, Jeff Eastman <je...@narus.com>
> > wrote:
> > >
> > >> Look at examples/bin/build-reuters.sh for some examples. They are all
> > from
> > >> the command line but illustrate the best way to do what you are
> > attempting.
> > >>
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clusteringalsohassome example code for doing text processing.
> > >>
> > >> -----Original Message-----
> > >> From: Manoj Kumar [mailto:manoj1987@gmail.com]
> > >> Sent: Monday, February 28, 2011 9:28 AM
> > >> To: user@mahout.apache.org
> > >> Subject: Re: LDA Mahout
> > >>
> > >> Hi Jeff Eastman,
> > >> Thanks for your reply. I looked into the LDADriver Class. But am not
> > sure
> > >> as
> > >> how to convert my text documents to Sequence Files and then to
> > >> SparseVectors
> > >> for giving input to LDADriver. Can you please help me in this
> > conversion.
> > >> ALso, is it enough to just call the run method in LDADriver Class with
> > >> appropriate inputs for modeling the topics?
> > >>
> > >> Thanks & Regards,
> > >> Manoj Kumar.R.K
> > >> Graduate Student, MS Computer Science
> > >> University at Buffalo
> > >> Buffalo, New York
> > >> (413) 461-8938|www.rkmanojkumar.co.nr
> > >>
> > >>
> > >>
> > >> On Mon, Feb 28, 2011 at 12:23 PM, Jeff Eastman <je...@narus.com>
> > >> wrote:
> > >>
> > >> > Have you looked at the Java classes that implement LDA? The private
> > >> > LDADriver.run() method should be made public, but this can be called
> > >> from
> > >> > Java in Eclipse (if that is what you mean by "using Eclipse"). You
> > could
> > >> > also look at the wiki for information on running LDA (
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
> > >> > ).
> > >> >
> > >> > -----Original Message-----
> > >> > From: Manoj Kumar [mailto:manoj1987@gmail.com]
> > >> > Sent: Monday, February 28, 2011 9:09 AM
> > >> > To: user@mahout.apache.org
> > >> > Subject: LDA Mahout
> > >> >
> > >> > Hi,
> > >> >
> > >> > I am doing a project which requires topic modeling of documents
> using
> > >> LDA.
> > >> > I
> > >> > am planning to implement this using Mahout LDA. I am not able to get
> > any
> > >> > sample codes for implementing this using Eclipse. Only command line
> > >> options
> > >> > where available. Kindly suggest me some tutorial or please provide
> me
> > >> some
> > >> > basic code for implementing LDA. Kindly reply.
> > >> >
> > >> > Thanks & Regards,
> > >> > Manoj Kumar.R.K
> > >> > Graduate Student, MS Computer Science
> > >> > University at Buffalo
> > >> > Buffalo, New York
> > >> > (413) 461-8938|www.rkmanojkumar.co.nr
> > >> >
> > >>
> > >
> > >
> >
>

Re: LDA Mahout

Posted by Dmitriy Lyubimov <dl...@apache.org>.
PS i know that because of reading Mahout in Action :)

On Tue, Mar 1, 2011 at 2:55 PM, Dmitriy Lyubimov <dl...@apache.org> wrote:
> There's a way to specify custom lucene analyzer with one of the jobs,
> i think it is seq2sparse. there's an option for that. Naturally, if
> you use your own analyzer, you might write it with your custom stop
> word list (or perhaps there's an option to do that with StopAnalyzer
> from lucene, or what's its name.)
>
> -d
>
> On Tue, Mar 1, 2011 at 2:42 PM, Jeff Eastman <je...@narus.com> wrote:
>> Not with seq2sparse. We do have some Lucene support which may allow this. Grant?
>>
>> -----Original Message-----
>> From: Manoj Kumar [mailto:manoj1987@gmail.com]
>> Sent: Tuesday, March 01, 2011 12:26 PM
>> To: user@mahout.apache.org
>> Subject: Re: LDA Mahout
>>
>> thanks. But is it possible to provide customized stop words list being
>> loaded from a text file?
>>
>> Thanks & Regards,
>> Manoj Kumar.R.K
>> Graduate Student, MS Computer Science
>> University at Buffalo
>> Buffalo, New York
>> (413) 461-8938|www.rkmanojkumar.co.nr
>>
>>
>>
>> On Tue, Mar 1, 2011 at 11:36 AM, Jeff Eastman <je...@narus.com> wrote:
>>
>>> Sure, seq2sparse has -maxDFPercent option which can be used to eliminate
>>> high frequency features like stop words. Check out the documentation at
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
>>> .
>>>
>>> -----Original Message-----
>>> From: Manoj Kumar [mailto:manoj1987@gmail.com]
>>> Sent: Monday, February 28, 2011 10:51 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: LDA Mahout
>>>
>>> Hi Jeff Eastman,
>>> Is there any options to perform stopwords removal while performing LDA in
>>> mahout or while creating sequence files from the corpus?
>>> Kindly reply.
>>>
>>> Thanks & Regards,
>>> Manoj Kumar.R.K
>>> Graduate Student, MS Computer Science
>>> University at Buffalo
>>> Buffalo, New York
>>> (413) 461-8938|www.rkmanojkumar.co.nr
>>>
>>>
>>>
>>> On Mon, Feb 28, 2011 at 1:06 PM, Manoj Kumar <ma...@gmail.com> wrote:
>>>
>>> > Hi Jeff Eastman,
>>> >
>>> > Thanks a lot. I ll look into it and will contact you in case of any help.
>>> >
>>> > Thanks & Regards,
>>> > Manoj Kumar.R.K
>>> > Graduate Student, MS Computer Science
>>> > University at Buffalo
>>> > Buffalo, New York
>>> > (413) 461-8938|www.rkmanojkumar.co.nr
>>> >
>>> >
>>> >
>>> > On Mon, Feb 28, 2011 at 12:48 PM, Jeff Eastman <je...@narus.com>
>>> wrote:
>>> >
>>> >> Look at examples/bin/build-reuters.sh for some examples. They are all
>>> from
>>> >> the command line but illustrate the best way to do what you are
>>> attempting.
>>> >>
>>> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clusteringalsohas some example code for doing text processing.
>>> >>
>>> >> -----Original Message-----
>>> >> From: Manoj Kumar [mailto:manoj1987@gmail.com]
>>> >> Sent: Monday, February 28, 2011 9:28 AM
>>> >> To: user@mahout.apache.org
>>> >> Subject: Re: LDA Mahout
>>> >>
>>> >> Hi Jeff Eastman,
>>> >> Thanks for your reply. I looked into the LDADriver Class. But am not
>>> sure
>>> >> as
>>> >> how to convert my text documents to Sequence Files and then to
>>> >> SparseVectors
>>> >> for giving input to LDADriver. Can you please help me in this
>>> conversion.
>>> >> ALso, is it enough to just call the run method in LDADriver Class with
>>> >> appropriate inputs for modeling the topics?
>>> >>
>>> >> Thanks & Regards,
>>> >> Manoj Kumar.R.K
>>> >> Graduate Student, MS Computer Science
>>> >> University at Buffalo
>>> >> Buffalo, New York
>>> >> (413) 461-8938|www.rkmanojkumar.co.nr
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Feb 28, 2011 at 12:23 PM, Jeff Eastman <je...@narus.com>
>>> >> wrote:
>>> >>
>>> >> > Have you looked at the Java classes that implement LDA? The private
>>> >> > LDADriver.run() method should be made public, but this can be called
>>> >> from
>>> >> > Java in Eclipse (if that is what you mean by "using Eclipse"). You
>>> could
>>> >> > also look at the wiki for information on running LDA (
>>> >> >
>>> >>
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
>>> >> > ).
>>> >> >
>>> >> > -----Original Message-----
>>> >> > From: Manoj Kumar [mailto:manoj1987@gmail.com]
>>> >> > Sent: Monday, February 28, 2011 9:09 AM
>>> >> > To: user@mahout.apache.org
>>> >> > Subject: LDA Mahout
>>> >> >
>>> >> > Hi,
>>> >> >
>>> >> > I am doing a project which requires topic modeling of documents using
>>> >> LDA.
>>> >> > I
>>> >> > am planning to implement this using Mahout LDA. I am not able to get
>>> any
>>> >> > sample codes for implementing this using Eclipse. Only command line
>>> >> options
>>> >> > where available. Kindly suggest me some tutorial or please provide me
>>> >> some
>>> >> > basic code for implementing LDA. Kindly reply.
>>> >> >
>>> >> > Thanks & Regards,
>>> >> > Manoj Kumar.R.K
>>> >> > Graduate Student, MS Computer Science
>>> >> > University at Buffalo
>>> >> > Buffalo, New York
>>> >> > (413) 461-8938|www.rkmanojkumar.co.nr
>>> >> >
>>> >>
>>> >
>>> >
>>>
>>
>

Re: LDA Mahout

Posted by Dmitriy Lyubimov <dl...@apache.org>.
There's a way to specify custom lucene analyzer with one of the jobs,
i think it is seq2sparse. there's an option for that. Naturally, if
you use your own analyzer, you might write it with your custom stop
word list (or perhaps there's an option to do that with StopAnalyzer
from lucene, or what's its name.)

-d

On Tue, Mar 1, 2011 at 2:42 PM, Jeff Eastman <je...@narus.com> wrote:
> Not with seq2sparse. We do have some Lucene support which may allow this. Grant?
>
> -----Original Message-----
> From: Manoj Kumar [mailto:manoj1987@gmail.com]
> Sent: Tuesday, March 01, 2011 12:26 PM
> To: user@mahout.apache.org
> Subject: Re: LDA Mahout
>
> thanks. But is it possible to provide customized stop words list being
> loaded from a text file?
>
> Thanks & Regards,
> Manoj Kumar.R.K
> Graduate Student, MS Computer Science
> University at Buffalo
> Buffalo, New York
> (413) 461-8938|www.rkmanojkumar.co.nr
>
>
>
> On Tue, Mar 1, 2011 at 11:36 AM, Jeff Eastman <je...@narus.com> wrote:
>
>> Sure, seq2sparse has -maxDFPercent option which can be used to eliminate
>> high frequency features like stop words. Check out the documentation at
>> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
>> .
>>
>> -----Original Message-----
>> From: Manoj Kumar [mailto:manoj1987@gmail.com]
>> Sent: Monday, February 28, 2011 10:51 PM
>> To: user@mahout.apache.org
>> Subject: Re: LDA Mahout
>>
>> Hi Jeff Eastman,
>> Is there any options to perform stopwords removal while performing LDA in
>> mahout or while creating sequence files from the corpus?
>> Kindly reply.
>>
>> Thanks & Regards,
>> Manoj Kumar.R.K
>> Graduate Student, MS Computer Science
>> University at Buffalo
>> Buffalo, New York
>> (413) 461-8938|www.rkmanojkumar.co.nr
>>
>>
>>
>> On Mon, Feb 28, 2011 at 1:06 PM, Manoj Kumar <ma...@gmail.com> wrote:
>>
>> > Hi Jeff Eastman,
>> >
>> > Thanks a lot. I ll look into it and will contact you in case of any help.
>> >
>> > Thanks & Regards,
>> > Manoj Kumar.R.K
>> > Graduate Student, MS Computer Science
>> > University at Buffalo
>> > Buffalo, New York
>> > (413) 461-8938|www.rkmanojkumar.co.nr
>> >
>> >
>> >
>> > On Mon, Feb 28, 2011 at 12:48 PM, Jeff Eastman <je...@narus.com>
>> wrote:
>> >
>> >> Look at examples/bin/build-reuters.sh for some examples. They are all
>> from
>> >> the command line but illustrate the best way to do what you are
>> attempting.
>> >>
>> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clusteringalsohas some example code for doing text processing.
>> >>
>> >> -----Original Message-----
>> >> From: Manoj Kumar [mailto:manoj1987@gmail.com]
>> >> Sent: Monday, February 28, 2011 9:28 AM
>> >> To: user@mahout.apache.org
>> >> Subject: Re: LDA Mahout
>> >>
>> >> Hi Jeff Eastman,
>> >> Thanks for your reply. I looked into the LDADriver Class. But am not
>> sure
>> >> as
>> >> how to convert my text documents to Sequence Files and then to
>> >> SparseVectors
>> >> for giving input to LDADriver. Can you please help me in this
>> conversion.
>> >> ALso, is it enough to just call the run method in LDADriver Class with
>> >> appropriate inputs for modeling the topics?
>> >>
>> >> Thanks & Regards,
>> >> Manoj Kumar.R.K
>> >> Graduate Student, MS Computer Science
>> >> University at Buffalo
>> >> Buffalo, New York
>> >> (413) 461-8938|www.rkmanojkumar.co.nr
>> >>
>> >>
>> >>
>> >> On Mon, Feb 28, 2011 at 12:23 PM, Jeff Eastman <je...@narus.com>
>> >> wrote:
>> >>
>> >> > Have you looked at the Java classes that implement LDA? The private
>> >> > LDADriver.run() method should be made public, but this can be called
>> >> from
>> >> > Java in Eclipse (if that is what you mean by "using Eclipse"). You
>> could
>> >> > also look at the wiki for information on running LDA (
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
>> >> > ).
>> >> >
>> >> > -----Original Message-----
>> >> > From: Manoj Kumar [mailto:manoj1987@gmail.com]
>> >> > Sent: Monday, February 28, 2011 9:09 AM
>> >> > To: user@mahout.apache.org
>> >> > Subject: LDA Mahout
>> >> >
>> >> > Hi,
>> >> >
>> >> > I am doing a project which requires topic modeling of documents using
>> >> LDA.
>> >> > I
>> >> > am planning to implement this using Mahout LDA. I am not able to get
>> any
>> >> > sample codes for implementing this using Eclipse. Only command line
>> >> options
>> >> > where available. Kindly suggest me some tutorial or please provide me
>> >> some
>> >> > basic code for implementing LDA. Kindly reply.
>> >> >
>> >> > Thanks & Regards,
>> >> > Manoj Kumar.R.K
>> >> > Graduate Student, MS Computer Science
>> >> > University at Buffalo
>> >> > Buffalo, New York
>> >> > (413) 461-8938|www.rkmanojkumar.co.nr
>> >> >
>> >>
>> >
>> >
>>
>

RE: LDA Mahout

Posted by Jeff Eastman <je...@Narus.com>.
Not with seq2sparse. We do have some Lucene support which may allow this. Grant?

-----Original Message-----
From: Manoj Kumar [mailto:manoj1987@gmail.com] 
Sent: Tuesday, March 01, 2011 12:26 PM
To: user@mahout.apache.org
Subject: Re: LDA Mahout

thanks. But is it possible to provide customized stop words list being
loaded from a text file?

Thanks & Regards,
Manoj Kumar.R.K
Graduate Student, MS Computer Science
University at Buffalo
Buffalo, New York
(413) 461-8938|www.rkmanojkumar.co.nr



On Tue, Mar 1, 2011 at 11:36 AM, Jeff Eastman <je...@narus.com> wrote:

> Sure, seq2sparse has -maxDFPercent option which can be used to eliminate
> high frequency features like stop words. Check out the documentation at
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
> .
>
> -----Original Message-----
> From: Manoj Kumar [mailto:manoj1987@gmail.com]
> Sent: Monday, February 28, 2011 10:51 PM
> To: user@mahout.apache.org
> Subject: Re: LDA Mahout
>
> Hi Jeff Eastman,
> Is there any options to perform stopwords removal while performing LDA in
> mahout or while creating sequence files from the corpus?
> Kindly reply.
>
> Thanks & Regards,
> Manoj Kumar.R.K
> Graduate Student, MS Computer Science
> University at Buffalo
> Buffalo, New York
> (413) 461-8938|www.rkmanojkumar.co.nr
>
>
>
> On Mon, Feb 28, 2011 at 1:06 PM, Manoj Kumar <ma...@gmail.com> wrote:
>
> > Hi Jeff Eastman,
> >
> > Thanks a lot. I ll look into it and will contact you in case of any help.
> >
> > Thanks & Regards,
> > Manoj Kumar.R.K
> > Graduate Student, MS Computer Science
> > University at Buffalo
> > Buffalo, New York
> > (413) 461-8938|www.rkmanojkumar.co.nr
> >
> >
> >
> > On Mon, Feb 28, 2011 at 12:48 PM, Jeff Eastman <je...@narus.com>
> wrote:
> >
> >> Look at examples/bin/build-reuters.sh for some examples. They are all
> from
> >> the command line but illustrate the best way to do what you are
> attempting.
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clusteringalsohas some example code for doing text processing.
> >>
> >> -----Original Message-----
> >> From: Manoj Kumar [mailto:manoj1987@gmail.com]
> >> Sent: Monday, February 28, 2011 9:28 AM
> >> To: user@mahout.apache.org
> >> Subject: Re: LDA Mahout
> >>
> >> Hi Jeff Eastman,
> >> Thanks for your reply. I looked into the LDADriver Class. But am not
> sure
> >> as
> >> how to convert my text documents to Sequence Files and then to
> >> SparseVectors
> >> for giving input to LDADriver. Can you please help me in this
> conversion.
> >> ALso, is it enough to just call the run method in LDADriver Class with
> >> appropriate inputs for modeling the topics?
> >>
> >> Thanks & Regards,
> >> Manoj Kumar.R.K
> >> Graduate Student, MS Computer Science
> >> University at Buffalo
> >> Buffalo, New York
> >> (413) 461-8938|www.rkmanojkumar.co.nr
> >>
> >>
> >>
> >> On Mon, Feb 28, 2011 at 12:23 PM, Jeff Eastman <je...@narus.com>
> >> wrote:
> >>
> >> > Have you looked at the Java classes that implement LDA? The private
> >> > LDADriver.run() method should be made public, but this can be called
> >> from
> >> > Java in Eclipse (if that is what you mean by "using Eclipse"). You
> could
> >> > also look at the wiki for information on running LDA (
> >> >
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
> >> > ).
> >> >
> >> > -----Original Message-----
> >> > From: Manoj Kumar [mailto:manoj1987@gmail.com]
> >> > Sent: Monday, February 28, 2011 9:09 AM
> >> > To: user@mahout.apache.org
> >> > Subject: LDA Mahout
> >> >
> >> > Hi,
> >> >
> >> > I am doing a project which requires topic modeling of documents using
> >> LDA.
> >> > I
> >> > am planning to implement this using Mahout LDA. I am not able to get
> any
> >> > sample codes for implementing this using Eclipse. Only command line
> >> options
> >> > where available. Kindly suggest me some tutorial or please provide me
> >> some
> >> > basic code for implementing LDA. Kindly reply.
> >> >
> >> > Thanks & Regards,
> >> > Manoj Kumar.R.K
> >> > Graduate Student, MS Computer Science
> >> > University at Buffalo
> >> > Buffalo, New York
> >> > (413) 461-8938|www.rkmanojkumar.co.nr
> >> >
> >>
> >
> >
>

Re: LDA Mahout

Posted by Manoj Kumar <ma...@gmail.com>.
thanks. But is it possible to provide customized stop words list being
loaded from a text file?

Thanks & Regards,
Manoj Kumar.R.K
Graduate Student, MS Computer Science
University at Buffalo
Buffalo, New York
(413) 461-8938|www.rkmanojkumar.co.nr



On Tue, Mar 1, 2011 at 11:36 AM, Jeff Eastman <je...@narus.com> wrote:

> Sure, seq2sparse has -maxDFPercent option which can be used to eliminate
> high frequency features like stop words. Check out the documentation at
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
> .
>
> -----Original Message-----
> From: Manoj Kumar [mailto:manoj1987@gmail.com]
> Sent: Monday, February 28, 2011 10:51 PM
> To: user@mahout.apache.org
> Subject: Re: LDA Mahout
>
> Hi Jeff Eastman,
> Is there any options to perform stopwords removal while performing LDA in
> mahout or while creating sequence files from the corpus?
> Kindly reply.
>
> Thanks & Regards,
> Manoj Kumar.R.K
> Graduate Student, MS Computer Science
> University at Buffalo
> Buffalo, New York
> (413) 461-8938|www.rkmanojkumar.co.nr
>
>
>
> On Mon, Feb 28, 2011 at 1:06 PM, Manoj Kumar <ma...@gmail.com> wrote:
>
> > Hi Jeff Eastman,
> >
> > Thanks a lot. I ll look into it and will contact you in case of any help.
> >
> > Thanks & Regards,
> > Manoj Kumar.R.K
> > Graduate Student, MS Computer Science
> > University at Buffalo
> > Buffalo, New York
> > (413) 461-8938|www.rkmanojkumar.co.nr
> >
> >
> >
> > On Mon, Feb 28, 2011 at 12:48 PM, Jeff Eastman <je...@narus.com>
> wrote:
> >
> >> Look at examples/bin/build-reuters.sh for some examples. They are all
> from
> >> the command line but illustrate the best way to do what you are
> attempting.
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clusteringalsohas some example code for doing text processing.
> >>
> >> -----Original Message-----
> >> From: Manoj Kumar [mailto:manoj1987@gmail.com]
> >> Sent: Monday, February 28, 2011 9:28 AM
> >> To: user@mahout.apache.org
> >> Subject: Re: LDA Mahout
> >>
> >> Hi Jeff Eastman,
> >> Thanks for your reply. I looked into the LDADriver Class. But am not
> sure
> >> as
> >> how to convert my text documents to Sequence Files and then to
> >> SparseVectors
> >> for giving input to LDADriver. Can you please help me in this
> conversion.
> >> ALso, is it enough to just call the run method in LDADriver Class with
> >> appropriate inputs for modeling the topics?
> >>
> >> Thanks & Regards,
> >> Manoj Kumar.R.K
> >> Graduate Student, MS Computer Science
> >> University at Buffalo
> >> Buffalo, New York
> >> (413) 461-8938|www.rkmanojkumar.co.nr
> >>
> >>
> >>
> >> On Mon, Feb 28, 2011 at 12:23 PM, Jeff Eastman <je...@narus.com>
> >> wrote:
> >>
> >> > Have you looked at the Java classes that implement LDA? The private
> >> > LDADriver.run() method should be made public, but this can be called
> >> from
> >> > Java in Eclipse (if that is what you mean by "using Eclipse"). You
> could
> >> > also look at the wiki for information on running LDA (
> >> >
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
> >> > ).
> >> >
> >> > -----Original Message-----
> >> > From: Manoj Kumar [mailto:manoj1987@gmail.com]
> >> > Sent: Monday, February 28, 2011 9:09 AM
> >> > To: user@mahout.apache.org
> >> > Subject: LDA Mahout
> >> >
> >> > Hi,
> >> >
> >> > I am doing a project which requires topic modeling of documents using
> >> LDA.
> >> > I
> >> > am planning to implement this using Mahout LDA. I am not able to get
> any
> >> > sample codes for implementing this using Eclipse. Only command line
> >> options
> >> > where available. Kindly suggest me some tutorial or please provide me
> >> some
> >> > basic code for implementing LDA. Kindly reply.
> >> >
> >> > Thanks & Regards,
> >> > Manoj Kumar.R.K
> >> > Graduate Student, MS Computer Science
> >> > University at Buffalo
> >> > Buffalo, New York
> >> > (413) 461-8938|www.rkmanojkumar.co.nr
> >> >
> >>
> >
> >
>

RE: LDA Mahout

Posted by Jeff Eastman <je...@Narus.com>.
Sure, seq2sparse has -maxDFPercent option which can be used to eliminate high frequency features like stop words. Check out the documentation at https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text.

-----Original Message-----
From: Manoj Kumar [mailto:manoj1987@gmail.com] 
Sent: Monday, February 28, 2011 10:51 PM
To: user@mahout.apache.org
Subject: Re: LDA Mahout

Hi Jeff Eastman,
Is there any options to perform stopwords removal while performing LDA in
mahout or while creating sequence files from the corpus?
Kindly reply.

Thanks & Regards,
Manoj Kumar.R.K
Graduate Student, MS Computer Science
University at Buffalo
Buffalo, New York
(413) 461-8938|www.rkmanojkumar.co.nr



On Mon, Feb 28, 2011 at 1:06 PM, Manoj Kumar <ma...@gmail.com> wrote:

> Hi Jeff Eastman,
>
> Thanks a lot. I ll look into it and will contact you in case of any help.
>
> Thanks & Regards,
> Manoj Kumar.R.K
> Graduate Student, MS Computer Science
> University at Buffalo
> Buffalo, New York
> (413) 461-8938|www.rkmanojkumar.co.nr
>
>
>
> On Mon, Feb 28, 2011 at 12:48 PM, Jeff Eastman <je...@narus.com> wrote:
>
>> Look at examples/bin/build-reuters.sh for some examples. They are all from
>> the command line but illustrate the best way to do what you are attempting.
>> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clusteringalso has some example code for doing text processing.
>>
>> -----Original Message-----
>> From: Manoj Kumar [mailto:manoj1987@gmail.com]
>> Sent: Monday, February 28, 2011 9:28 AM
>> To: user@mahout.apache.org
>> Subject: Re: LDA Mahout
>>
>> Hi Jeff Eastman,
>> Thanks for your reply. I looked into the LDADriver Class. But am not sure
>> as
>> how to convert my text documents to Sequence Files and then to
>> SparseVectors
>> for giving input to LDADriver. Can you please help me in this conversion.
>> ALso, is it enough to just call the run method in LDADriver Class with
>> appropriate inputs for modeling the topics?
>>
>> Thanks & Regards,
>> Manoj Kumar.R.K
>> Graduate Student, MS Computer Science
>> University at Buffalo
>> Buffalo, New York
>> (413) 461-8938|www.rkmanojkumar.co.nr
>>
>>
>>
>> On Mon, Feb 28, 2011 at 12:23 PM, Jeff Eastman <je...@narus.com>
>> wrote:
>>
>> > Have you looked at the Java classes that implement LDA? The private
>> > LDADriver.run() method should be made public, but this can be called
>> from
>> > Java in Eclipse (if that is what you mean by "using Eclipse"). You could
>> > also look at the wiki for information on running LDA (
>> >
>> https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
>> > ).
>> >
>> > -----Original Message-----
>> > From: Manoj Kumar [mailto:manoj1987@gmail.com]
>> > Sent: Monday, February 28, 2011 9:09 AM
>> > To: user@mahout.apache.org
>> > Subject: LDA Mahout
>> >
>> > Hi,
>> >
>> > I am doing a project which requires topic modeling of documents using
>> LDA.
>> > I
>> > am planning to implement this using Mahout LDA. I am not able to get any
>> > sample codes for implementing this using Eclipse. Only command line
>> options
>> > where available. Kindly suggest me some tutorial or please provide me
>> some
>> > basic code for implementing LDA. Kindly reply.
>> >
>> > Thanks & Regards,
>> > Manoj Kumar.R.K
>> > Graduate Student, MS Computer Science
>> > University at Buffalo
>> > Buffalo, New York
>> > (413) 461-8938|www.rkmanojkumar.co.nr
>> >
>>
>
>