You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Benoit Mathieu <bm...@deezer.com> on 2013/03/04 13:00:15 UTC

LDA with custom vectors

Hi mahout users,

I'd like to run the mahout Latent Dirichlet Allocation algorithm (mahout
cvb) on my own data. I have about 1M "documents" and a vocabulary of 30k
"terms". Documents are very sparse, each of them contains only 100 terms.
I'd like to extract "topics" from that.

I have generated mahout vectors from my data using a simple java program,
and using RandomAccessSparseVector.

I successfully launched the "mahout cvb with" job with num_topics=200, but
the job seems very slow: 70 running map tasks took 10mn to process about
25000 documents on my cluster.

So my questions are:
- Does this job require specific Vector class for good performance ?
- Is LDA algorithm suitable to process 1M docs with a dictionary of 30k
terms ?

Thanks for any insights.

++
benoit

Re: LDA with custom vectors

Posted by Benoit Mathieu <bm...@deezer.com>.

Here is my command line:

mahout cvb --input user_model/vectors --output user_model/output
--num_topics 200 --num_terms 28892 --dictionary user_model/dictionary
--maxIter 10

benoit




2013/3/4 Jake Mannix <ja...@gmail.com>

> Can you send us your command line args? Is that for 1 iteration ?  That
> would be very very slow
>
> On Monday, March 4, 2013, Benoit Mathieu wrote:
>
> > Hi mahout users,
> >
> > I'd like to run the mahout Latent Dirichlet Allocation algorithm (mahout
> > cvb) on my own data. I have about 1M "documents" and a vocabulary of 30k
> > "terms". Documents are very sparse, each of them contains only 100 terms.
> > I'd like to extract "topics" from that.
> >
> > I have generated mahout vectors from my data using a simple java program,
> > and using RandomAccessSparseVector.
> >
> > I successfully launched the "mahout cvb with" job with num_topics=200,
> but
> > the job seems very slow: 70 running map tasks took 10mn to process about
> > 25000 documents on my cluster.
> >
> > So my questions are:
> > - Does this job require specific Vector class for good performance ?
> > - Is LDA algorithm suitable to process 1M docs with a dictionary of 30k
> > terms ?
> >
> > Thanks for any insights.
> >
> > ++
> > benoit
> >
>
>
> --
>
>   -jake
>

Re: LDA with custom vectors

Posted by Benoit Mathieu <bm...@deezer.com>.

My docterm_matrix is only 1 file of 200Mo:

> hadoop fs -ls user_model/vectors
196572321 2013-03-01 17:09 user_model/vectors

To increase map tasks parallelism I add
"-Dmapreduce.input.fileinputformat.split.maxsize=2097152" to the command
line. This way, the map phase is splitted into 94 tasks.






2013/3/4 Andy Schlaikjer <an...@gmail.com>

> Benoit, could you also paste us output of `hdfs -ls
> /path/to/your/docterm_matrix/part-*`? Cvb map-side parallelism benefits
> from an even distribution of doc-term vectors across your input part files.
>
>
> On Mon, Mar 4, 2013 at 8:34 AM, Jake Mannix <ja...@gmail.com> wrote:
>
> > Can you send us your command line args? Is that for 1 iteration ?  That
> > would be very very slow
> >
> > On Monday, March 4, 2013, Benoit Mathieu wrote:
> >
> > > Hi mahout users,
> > >
> > > I'd like to run the mahout Latent Dirichlet Allocation algorithm
> (mahout
> > > cvb) on my own data. I have about 1M "documents" and a vocabulary of
> 30k
> > > "terms". Documents are very sparse, each of them contains only 100
> terms.
> > > I'd like to extract "topics" from that.
> > >
> > > I have generated mahout vectors from my data using a simple java
> program,
> > > and using RandomAccessSparseVector.
> > >
> > > I successfully launched the "mahout cvb with" job with num_topics=200,
> > but
> > > the job seems very slow: 70 running map tasks took 10mn to process
> about
> > > 25000 documents on my cluster.
> > >
> > > So my questions are:
> > > - Does this job require specific Vector class for good performance ?
> > > - Is LDA algorithm suitable to process 1M docs with a dictionary of 30k
> > > terms ?
> > >
> > > Thanks for any insights.
> > >
> > > ++
> > > benoit
> > >
> >
> >
> > --
> >
> >   -jake
> >
>

Re: LDA with custom vectors

Posted by Andy Schlaikjer <an...@gmail.com>.

Benoit, could you also paste us output of `hdfs -ls
/path/to/your/docterm_matrix/part-*`? Cvb map-side parallelism benefits
from an even distribution of doc-term vectors across your input part files.


On Mon, Mar 4, 2013 at 8:34 AM, Jake Mannix <ja...@gmail.com> wrote:

> Can you send us your command line args? Is that for 1 iteration ?  That
> would be very very slow
>
> On Monday, March 4, 2013, Benoit Mathieu wrote:
>
> > Hi mahout users,
> >
> > I'd like to run the mahout Latent Dirichlet Allocation algorithm (mahout
> > cvb) on my own data. I have about 1M "documents" and a vocabulary of 30k
> > "terms". Documents are very sparse, each of them contains only 100 terms.
> > I'd like to extract "topics" from that.
> >
> > I have generated mahout vectors from my data using a simple java program,
> > and using RandomAccessSparseVector.
> >
> > I successfully launched the "mahout cvb with" job with num_topics=200,
> but
> > the job seems very slow: 70 running map tasks took 10mn to process about
> > 25000 documents on my cluster.
> >
> > So my questions are:
> > - Does this job require specific Vector class for good performance ?
> > - Is LDA algorithm suitable to process 1M docs with a dictionary of 30k
> > terms ?
> >
> > Thanks for any insights.
> >
> > ++
> > benoit
> >
>
>
> --
>
>   -jake
>

Re: LDA with custom vectors

Posted by Jake Mannix <ja...@gmail.com>.

Can you send us your command line args? Is that for 1 iteration ?  That
would be very very slow

On Monday, March 4, 2013, Benoit Mathieu wrote:

> Hi mahout users,
>
> I'd like to run the mahout Latent Dirichlet Allocation algorithm (mahout
> cvb) on my own data. I have about 1M "documents" and a vocabulary of 30k
> "terms". Documents are very sparse, each of them contains only 100 terms.
> I'd like to extract "topics" from that.
>
> I have generated mahout vectors from my data using a simple java program,
> and using RandomAccessSparseVector.
>
> I successfully launched the "mahout cvb with" job with num_topics=200, but
> the job seems very slow: 70 running map tasks took 10mn to process about
> 25000 documents on my cluster.
>
> So my questions are:
> - Does this job require specific Vector class for good performance ?
> - Is LDA algorithm suitable to process 1M docs with a dictionary of 30k
> terms ?
>
> Thanks for any insights.
>
> ++
> benoit
>


-- 

  -jake