You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Cosmin Dumbrava <of...@gmail.com> on 2014/03/07 00:12:25 UTC

Reuters Example LDA Error (no help anywhere)

I don't know if is ok to mail on this address like this but... there is

I have executed cluster-reuters.sh from example directory (vers 1.0
SNAPSHOT) and at the end i only get a list of
.....
21575    {0.02:0.6314297270431626,0.03:
0.12547216143460152,0.007050:0.08061044448337305,0.04:0.07121802301642256,0.025:0.0677648308012434,0.003:0.0221466872297289,0.06:4.4720109631453837E-4,0.01:4.0331445050718065E-4,0.077:1.0509017796402916E-4,0.1:6.868649426131684E-5}
21576
{0.055:0.7123345754234253,0.003:0.10345316403842542,0.025:0.07850931669910466,0.1:0.0688641506163345,0.06:0.010599081492449824,0:0.0081953368778766,0.04:0.00469907695241742,0.03:0.003966985061879055,0.07:0.002197060890631658,0.0625:0.0020741956232281466}
21577
{0.04:0.5277733526037044,0.01:0.46656672162804314,0.07:0.0024295914763474164,0.1:0.002243674469679058,0.077:8.012577174900807E-4,0.007050:3.9184997476998896E-5,0.03:3.2141106779800255E-5,0.0625:2.4665616652494003E-5,0.02:1.949377177063371E-5,0.025:1.3329985998932362E-5}
....

$MAHOUT cvb \
    -i ${WORK_DIR}/reuters-out-matrix/matrix \
    -o ${WORK_DIR}/reuters-lda -k 20 -ow -x 20 \
    -dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
    -dt ${WORK_DIR}/reuters-lda-topics \
    -mt ${WORK_DIR}/reuters-lda-model \
  && \
  $MAHOUT vectordump \
    -i ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
    -o ${WORK_DIR}/reuters-lda/vectordump \
    -vs 10 -p true \
    -d ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
    -dt sequencefile -sort ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
    && \

 I must do something to output from this on?

The same thing happens when i tried to implement on my own


Thnaks in advance

Re: Reuters Example LDA Error (no help anywhere)

Posted by Andrew Musselman <an...@gmail.com>.
Filed a ticket here:  https://issues.apache.org/jira/browse/MAHOUT-1470


On Thu, Mar 6, 2014 at 4:36 PM, Cosinus WebDev <of...@gmail.com>wrote:

> Hi,
>
> Thank you for the answer, now I can rest a second :)
>
> Hope this will be fixed soon. If you file a JIRA please send me the link so
> I can watch the result.
>
> Thank you again,
>
> And one more question or two
> 1. vectordumping the cvb result(/work/out/cvb) is terms in topic
> 2. inside topics directory(/work/out/topics) should be the "best" terms
> from all topics ???
>
> bin/mahout cvb \
> -i /work/matrix \
> -o /work/out/cvb -k 100 -ow -x 20 \
> -dt /work/out/topics \
> ....
>
>
>
>
> On Fri, Mar 7, 2014 at 2:07 AM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > Typo in previous email, read as:
> >
> > "Ideally Mahout's missing a clusterdump like utility for that reads in
> LDA
> > topics, Document - DocumentId mapping and displays a report of the
> > topics and the documents that belong to a topic."
> >
> >
> >
> >
> > On Thursday, March 6, 2014 7:06 PM, Suneel Marthi <
> suneel_marthi@yahoo.com>
> > wrote:
> >
> > The script needs to be corrected to not call vectordump for LDA as
> > vectordump utility (or even clusterdump) are presently not capable of
> > displaying topics and relevant documents. I recall this issue was
> > previously reported by Peyman Faratin post 0.9 release.
> >
> > Ideally Mahout's missing a clusterdump utility for that reads in LDA
> > topics, Document - DocumentId mapping and displays a report of the topics
> > and the documents that belong to a topic.
> >
> > Meanwhile in order to see the generated topics and documents please refer
> > to this blog:
> >
> http://sujitpal.blogspot.com/2013/10/topic-modeling-with-mahout-on-amazon-emr.html
> >
> > Let me file a JIRA for this.
> >
> >
> >
> >
> >
> >
> > On Thursday, March 6, 2014 6:12 PM, Cosmin Dumbrava <
> > officewebdev@gmail.com> wrote:
> >
> > I don't know if is ok to mail on this address like this but... there is
> >
> > I have executed cluster-reuters.sh from example directory (vers 1.0
> > SNAPSHOT) and at the end i only get a list of
> > .....
> > 21575
> > {0.02:0.6314297270431626,0.03:
> >
> >
> 0.12547216143460152,0.007050:0.08061044448337305,0.04:0.07121802301642256,0.025:0.0677648308012434,0.003:0.0221466872297289,0.06:4.4720109631453837E-4,0.01:4.0331445050718065E-4,0.077:1.0509017796402916E-4,0.1:6.868649426131684E-5}
> > 21576
> >
> >
> {0.055:0.7123345754234253,0.003:0.10345316403842542,0.025:0.07850931669910466,0.1:0.0688641506163345,0.06:0.010599081492449824,0:0.0081953368778766,0.04:0.00469907695241742,0.03:0.003966985061879055,0.07:0.002197060890631658,0.0625:0.0020741956232281466}
> > 21577
> >
> >
> {0.04:0.5277733526037044,0.01:0.46656672162804314,0.07:0.0024295914763474164,0.1:0.002243674469679058,0.077:8.012577174900807E-4,0.007050:3.9184997476998896E-5,0.03:3.2141106779800255E-5,0.0625:2.4665616652494003E-5,0.02:1.949377177063371E-5,0.025:1.3329985998932362E-5}
> > ....
> >
> > $MAHOUT cvb \
> >     -i ${WORK_DIR}/reuters-out-matrix/matrix \
> >     -o ${WORK_DIR}/reuters-lda -k 20 -ow -x 20
> > \
> >     -dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
> >     -dt ${WORK_DIR}/reuters-lda-topics \
> >     -mt ${WORK_DIR}/reuters-lda-model \
> >   && \
> >   $MAHOUT vectordump \
> >     -i ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
> >     -o ${WORK_DIR}/reuters-lda/vectordump \
> >     -vs 10 -p true \
> >     -d ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
> >     -dt sequencefile -sort ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
> >     && \
> >
> > I must do something to output from this on?
> >
> > The same thing happens when i tried to implement on my own
> >
> >
> > Thnaks in advance
> >
>

Re: Reuters Example LDA Error (no help anywhere)

Posted by Cosinus WebDev <of...@gmail.com>.
Hi,

Thank you for the answer, now I can rest a second :)

Hope this will be fixed soon. If you file a JIRA please send me the link so
I can watch the result.

Thank you again,

And one more question or two
1. vectordumping the cvb result(/work/out/cvb) is terms in topic
2. inside topics directory(/work/out/topics) should be the "best" terms
from all topics ???

bin/mahout cvb \
-i /work/matrix \
-o /work/out/cvb -k 100 -ow -x 20 \
-dt /work/out/topics \
....




On Fri, Mar 7, 2014 at 2:07 AM, Suneel Marthi <su...@yahoo.com>wrote:

> Typo in previous email, read as:
>
> "Ideally Mahout's missing a clusterdump like utility for that reads in LDA
> topics, Document - DocumentId mapping and displays a report of the
> topics and the documents that belong to a topic."
>
>
>
>
> On Thursday, March 6, 2014 7:06 PM, Suneel Marthi <su...@yahoo.com>
> wrote:
>
> The script needs to be corrected to not call vectordump for LDA as
> vectordump utility (or even clusterdump) are presently not capable of
> displaying topics and relevant documents. I recall this issue was
> previously reported by Peyman Faratin post 0.9 release.
>
> Ideally Mahout's missing a clusterdump utility for that reads in LDA
> topics, Document - DocumentId mapping and displays a report of the topics
> and the documents that belong to a topic.
>
> Meanwhile in order to see the generated topics and documents please refer
> to this blog:
> http://sujitpal.blogspot.com/2013/10/topic-modeling-with-mahout-on-amazon-emr.html
>
> Let me file a JIRA for this.
>
>
>
>
>
>
> On Thursday, March 6, 2014 6:12 PM, Cosmin Dumbrava <
> officewebdev@gmail.com> wrote:
>
> I don't know if is ok to mail on this address like this but... there is
>
> I have executed cluster-reuters.sh from example directory (vers 1.0
> SNAPSHOT) and at the end i only get a list of
> .....
> 21575
> {0.02:0.6314297270431626,0.03:
>
> 0.12547216143460152,0.007050:0.08061044448337305,0.04:0.07121802301642256,0.025:0.0677648308012434,0.003:0.0221466872297289,0.06:4.4720109631453837E-4,0.01:4.0331445050718065E-4,0.077:1.0509017796402916E-4,0.1:6.868649426131684E-5}
> 21576
>
> {0.055:0.7123345754234253,0.003:0.10345316403842542,0.025:0.07850931669910466,0.1:0.0688641506163345,0.06:0.010599081492449824,0:0.0081953368778766,0.04:0.00469907695241742,0.03:0.003966985061879055,0.07:0.002197060890631658,0.0625:0.0020741956232281466}
> 21577
>
> {0.04:0.5277733526037044,0.01:0.46656672162804314,0.07:0.0024295914763474164,0.1:0.002243674469679058,0.077:8.012577174900807E-4,0.007050:3.9184997476998896E-5,0.03:3.2141106779800255E-5,0.0625:2.4665616652494003E-5,0.02:1.949377177063371E-5,0.025:1.3329985998932362E-5}
> ....
>
> $MAHOUT cvb \
>     -i ${WORK_DIR}/reuters-out-matrix/matrix \
>     -o ${WORK_DIR}/reuters-lda -k 20 -ow -x 20
> \
>     -dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
>     -dt ${WORK_DIR}/reuters-lda-topics \
>     -mt ${WORK_DIR}/reuters-lda-model \
>   && \
>   $MAHOUT vectordump \
>     -i ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
>     -o ${WORK_DIR}/reuters-lda/vectordump \
>     -vs 10 -p true \
>     -d ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
>     -dt sequencefile -sort ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
>     && \
>
> I must do something to output from this on?
>
> The same thing happens when i tried to implement on my own
>
>
> Thnaks in advance
>

Re: Reuters Example LDA Error (no help anywhere)

Posted by Suneel Marthi <su...@yahoo.com>.
Typo in previous email, read as:

"Ideally Mahout's missing a clusterdump like utility for that reads in LDA 
topics, Document - DocumentId mapping and displays a report of the 
topics and the documents that belong to a topic."




On Thursday, March 6, 2014 7:06 PM, Suneel Marthi <su...@yahoo.com> wrote:
 
The script needs to be corrected to not call vectordump for LDA as vectordump utility (or even clusterdump) are presently not capable of displaying topics and relevant documents. I recall this issue was previously reported by Peyman Faratin post 0.9 release. 

Ideally Mahout's missing a clusterdump utility for that reads in LDA topics, Document - DocumentId mapping and displays a report of the topics and the documents that belong to a topic.

Meanwhile in order to see the generated topics and documents please refer to this blog: http://sujitpal.blogspot.com/2013/10/topic-modeling-with-mahout-on-amazon-emr.html

Let me file a JIRA for this.






On Thursday, March 6, 2014 6:12 PM, Cosmin Dumbrava <of...@gmail.com> wrote:

I don't know if is ok to mail on this address like this but... there is

I have executed cluster-reuters.sh from example directory (vers 1.0
SNAPSHOT) and at the end i only get a list of
.....
21575   
{0.02:0.6314297270431626,0.03:
0.12547216143460152,0.007050:0.08061044448337305,0.04:0.07121802301642256,0.025:0.0677648308012434,0.003:0.0221466872297289,0.06:4.4720109631453837E-4,0.01:4.0331445050718065E-4,0.077:1.0509017796402916E-4,0.1:6.868649426131684E-5}
21576
{0.055:0.7123345754234253,0.003:0.10345316403842542,0.025:0.07850931669910466,0.1:0.0688641506163345,0.06:0.010599081492449824,0:0.0081953368778766,0.04:0.00469907695241742,0.03:0.003966985061879055,0.07:0.002197060890631658,0.0625:0.0020741956232281466}
21577
{0.04:0.5277733526037044,0.01:0.46656672162804314,0.07:0.0024295914763474164,0.1:0.002243674469679058,0.077:8.012577174900807E-4,0.007050:3.9184997476998896E-5,0.03:3.2141106779800255E-5,0.0625:2.4665616652494003E-5,0.02:1.949377177063371E-5,0.025:1.3329985998932362E-5}
....

$MAHOUT cvb \
    -i ${WORK_DIR}/reuters-out-matrix/matrix \
    -o ${WORK_DIR}/reuters-lda -k 20 -ow -x 20
\
    -dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
    -dt ${WORK_DIR}/reuters-lda-topics \
    -mt ${WORK_DIR}/reuters-lda-model \
  && \
  $MAHOUT vectordump \
    -i ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
    -o ${WORK_DIR}/reuters-lda/vectordump \
    -vs 10 -p true \
    -d ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
    -dt sequencefile -sort ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
    && \

I must do something to output from this on?

The same thing happens when i tried to implement on my own


Thnaks in advance

Re: Reuters Example LDA Error (no help anywhere)

Posted by Suneel Marthi <su...@yahoo.com>.
The script needs to be corrected to not call vectordump for LDA as vectordump utility (or even clusterdump) are presently not capable of displaying topics and relevant documents. I recall this issue was previously reported by Peyman Faratin post 0.9 release. 

Ideally Mahout's missing a clusterdump utility for that reads in LDA topics, Document - DocumentId mapping and displays a report of the topics and the documents that belong to a topic.

Meanwhile in order to see the generated topics and documents please refer to this blog: http://sujitpal.blogspot.com/2013/10/topic-modeling-with-mahout-on-amazon-emr.html

Let me file a JIRA for this.





On Thursday, March 6, 2014 6:12 PM, Cosmin Dumbrava <of...@gmail.com> wrote:
 
I don't know if is ok to mail on this address like this but... there is

I have executed cluster-reuters.sh from example directory (vers 1.0
SNAPSHOT) and at the end i only get a list of
.....
21575   
 {0.02:0.6314297270431626,0.03:
0.12547216143460152,0.007050:0.08061044448337305,0.04:0.07121802301642256,0.025:0.0677648308012434,0.003:0.0221466872297289,0.06:4.4720109631453837E-4,0.01:4.0331445050718065E-4,0.077:1.0509017796402916E-4,0.1:6.868649426131684E-5}
21576
{0.055:0.7123345754234253,0.003:0.10345316403842542,0.025:0.07850931669910466,0.1:0.0688641506163345,0.06:0.010599081492449824,0:0.0081953368778766,0.04:0.00469907695241742,0.03:0.003966985061879055,0.07:0.002197060890631658,0.0625:0.0020741956232281466}
21577
{0.04:0.5277733526037044,0.01:0.46656672162804314,0.07:0.0024295914763474164,0.1:0.002243674469679058,0.077:8.012577174900807E-4,0.007050:3.9184997476998896E-5,0.03:3.2141106779800255E-5,0.0625:2.4665616652494003E-5,0.02:1.949377177063371E-5,0.025:1.3329985998932362E-5}
....

$MAHOUT cvb \
    -i ${WORK_DIR}/reuters-out-matrix/matrix \
    -o ${WORK_DIR}/reuters-lda -k 20 -ow -x 20
 \
    -dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
    -dt ${WORK_DIR}/reuters-lda-topics \
    -mt ${WORK_DIR}/reuters-lda-model \
  && \
  $MAHOUT vectordump \
    -i ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
    -o ${WORK_DIR}/reuters-lda/vectordump \
    -vs 10 -p true \
    -d ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
    -dt sequencefile -sort ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
    && \

I must do something to output from this on?

The same thing happens when i tried to implement on my own


Thnaks in advance