You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Cosmin Dumbrava <of...@gmail.com> on 2014/03/07 00:12:25 UTC
Reuters Example LDA Error (no help anywhere)
I don't know if is ok to mail on this address like this but... there is
I have executed cluster-reuters.sh from example directory (vers 1.0
SNAPSHOT) and at the end i only get a list of
.....
21575 {0.02:0.6314297270431626,0.03:
0.12547216143460152,0.007050:0.08061044448337305,0.04:0.07121802301642256,0.025:0.0677648308012434,0.003:0.0221466872297289,0.06:4.4720109631453837E-4,0.01:4.0331445050718065E-4,0.077:1.0509017796402916E-4,0.1:6.868649426131684E-5}
21576
{0.055:0.7123345754234253,0.003:0.10345316403842542,0.025:0.07850931669910466,0.1:0.0688641506163345,0.06:0.010599081492449824,0:0.0081953368778766,0.04:0.00469907695241742,0.03:0.003966985061879055,0.07:0.002197060890631658,0.0625:0.0020741956232281466}
21577
{0.04:0.5277733526037044,0.01:0.46656672162804314,0.07:0.0024295914763474164,0.1:0.002243674469679058,0.077:8.012577174900807E-4,0.007050:3.9184997476998896E-5,0.03:3.2141106779800255E-5,0.0625:2.4665616652494003E-5,0.02:1.949377177063371E-5,0.025:1.3329985998932362E-5}
....
$MAHOUT cvb \
-i ${WORK_DIR}/reuters-out-matrix/matrix \
-o ${WORK_DIR}/reuters-lda -k 20 -ow -x 20 \
-dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
-dt ${WORK_DIR}/reuters-lda-topics \
-mt ${WORK_DIR}/reuters-lda-model \
&& \
$MAHOUT vectordump \
-i ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
-o ${WORK_DIR}/reuters-lda/vectordump \
-vs 10 -p true \
-d ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
-dt sequencefile -sort ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
&& \
I must do something to output from this on?
The same thing happens when i tried to implement on my own
Thnaks in advance
Re: Reuters Example LDA Error (no help anywhere)
Posted by Andrew Musselman <an...@gmail.com>.
Filed a ticket here: https://issues.apache.org/jira/browse/MAHOUT-1470
On Thu, Mar 6, 2014 at 4:36 PM, Cosinus WebDev <of...@gmail.com>wrote:
> Hi,
>
> Thank you for the answer, now I can rest a second :)
>
> Hope this will be fixed soon. If you file a JIRA please send me the link so
> I can watch the result.
>
> Thank you again,
>
> And one more question or two
> 1. vectordumping the cvb result(/work/out/cvb) is terms in topic
> 2. inside topics directory(/work/out/topics) should be the "best" terms
> from all topics ???
>
> bin/mahout cvb \
> -i /work/matrix \
> -o /work/out/cvb -k 100 -ow -x 20 \
> -dt /work/out/topics \
> ....
>
>
>
>
> On Fri, Mar 7, 2014 at 2:07 AM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > Typo in previous email, read as:
> >
> > "Ideally Mahout's missing a clusterdump like utility for that reads in
> LDA
> > topics, Document - DocumentId mapping and displays a report of the
> > topics and the documents that belong to a topic."
> >
> >
> >
> >
> > On Thursday, March 6, 2014 7:06 PM, Suneel Marthi <
> suneel_marthi@yahoo.com>
> > wrote:
> >
> > The script needs to be corrected to not call vectordump for LDA as
> > vectordump utility (or even clusterdump) are presently not capable of
> > displaying topics and relevant documents. I recall this issue was
> > previously reported by Peyman Faratin post 0.9 release.
> >
> > Ideally Mahout's missing a clusterdump utility for that reads in LDA
> > topics, Document - DocumentId mapping and displays a report of the topics
> > and the documents that belong to a topic.
> >
> > Meanwhile in order to see the generated topics and documents please refer
> > to this blog:
> >
> http://sujitpal.blogspot.com/2013/10/topic-modeling-with-mahout-on-amazon-emr.html
> >
> > Let me file a JIRA for this.
> >
> >
> >
> >
> >
> >
> > On Thursday, March 6, 2014 6:12 PM, Cosmin Dumbrava <
> > officewebdev@gmail.com> wrote:
> >
> > I don't know if is ok to mail on this address like this but... there is
> >
> > I have executed cluster-reuters.sh from example directory (vers 1.0
> > SNAPSHOT) and at the end i only get a list of
> > .....
> > 21575
> > {0.02:0.6314297270431626,0.03:
> >
> >
> 0.12547216143460152,0.007050:0.08061044448337305,0.04:0.07121802301642256,0.025:0.0677648308012434,0.003:0.0221466872297289,0.06:4.4720109631453837E-4,0.01:4.0331445050718065E-4,0.077:1.0509017796402916E-4,0.1:6.868649426131684E-5}
> > 21576
> >
> >
> {0.055:0.7123345754234253,0.003:0.10345316403842542,0.025:0.07850931669910466,0.1:0.0688641506163345,0.06:0.010599081492449824,0:0.0081953368778766,0.04:0.00469907695241742,0.03:0.003966985061879055,0.07:0.002197060890631658,0.0625:0.0020741956232281466}
> > 21577
> >
> >
> {0.04:0.5277733526037044,0.01:0.46656672162804314,0.07:0.0024295914763474164,0.1:0.002243674469679058,0.077:8.012577174900807E-4,0.007050:3.9184997476998896E-5,0.03:3.2141106779800255E-5,0.0625:2.4665616652494003E-5,0.02:1.949377177063371E-5,0.025:1.3329985998932362E-5}
> > ....
> >
> > $MAHOUT cvb \
> > -i ${WORK_DIR}/reuters-out-matrix/matrix \
> > -o ${WORK_DIR}/reuters-lda -k 20 -ow -x 20
> > \
> > -dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
> > -dt ${WORK_DIR}/reuters-lda-topics \
> > -mt ${WORK_DIR}/reuters-lda-model \
> > && \
> > $MAHOUT vectordump \
> > -i ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
> > -o ${WORK_DIR}/reuters-lda/vectordump \
> > -vs 10 -p true \
> > -d ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
> > -dt sequencefile -sort ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
> > && \
> >
> > I must do something to output from this on?
> >
> > The same thing happens when i tried to implement on my own
> >
> >
> > Thnaks in advance
> >
>
Re: Reuters Example LDA Error (no help anywhere)
Posted by Cosinus WebDev <of...@gmail.com>.
Hi,
Thank you for the answer, now I can rest a second :)
Hope this will be fixed soon. If you file a JIRA please send me the link so
I can watch the result.
Thank you again,
And one more question or two
1. vectordumping the cvb result(/work/out/cvb) is terms in topic
2. inside topics directory(/work/out/topics) should be the "best" terms
from all topics ???
bin/mahout cvb \
-i /work/matrix \
-o /work/out/cvb -k 100 -ow -x 20 \
-dt /work/out/topics \
....
On Fri, Mar 7, 2014 at 2:07 AM, Suneel Marthi <su...@yahoo.com>wrote:
> Typo in previous email, read as:
>
> "Ideally Mahout's missing a clusterdump like utility for that reads in LDA
> topics, Document - DocumentId mapping and displays a report of the
> topics and the documents that belong to a topic."
>
>
>
>
> On Thursday, March 6, 2014 7:06 PM, Suneel Marthi <su...@yahoo.com>
> wrote:
>
> The script needs to be corrected to not call vectordump for LDA as
> vectordump utility (or even clusterdump) are presently not capable of
> displaying topics and relevant documents. I recall this issue was
> previously reported by Peyman Faratin post 0.9 release.
>
> Ideally Mahout's missing a clusterdump utility for that reads in LDA
> topics, Document - DocumentId mapping and displays a report of the topics
> and the documents that belong to a topic.
>
> Meanwhile in order to see the generated topics and documents please refer
> to this blog:
> http://sujitpal.blogspot.com/2013/10/topic-modeling-with-mahout-on-amazon-emr.html
>
> Let me file a JIRA for this.
>
>
>
>
>
>
> On Thursday, March 6, 2014 6:12 PM, Cosmin Dumbrava <
> officewebdev@gmail.com> wrote:
>
> I don't know if is ok to mail on this address like this but... there is
>
> I have executed cluster-reuters.sh from example directory (vers 1.0
> SNAPSHOT) and at the end i only get a list of
> .....
> 21575
> {0.02:0.6314297270431626,0.03:
>
> 0.12547216143460152,0.007050:0.08061044448337305,0.04:0.07121802301642256,0.025:0.0677648308012434,0.003:0.0221466872297289,0.06:4.4720109631453837E-4,0.01:4.0331445050718065E-4,0.077:1.0509017796402916E-4,0.1:6.868649426131684E-5}
> 21576
>
> {0.055:0.7123345754234253,0.003:0.10345316403842542,0.025:0.07850931669910466,0.1:0.0688641506163345,0.06:0.010599081492449824,0:0.0081953368778766,0.04:0.00469907695241742,0.03:0.003966985061879055,0.07:0.002197060890631658,0.0625:0.0020741956232281466}
> 21577
>
> {0.04:0.5277733526037044,0.01:0.46656672162804314,0.07:0.0024295914763474164,0.1:0.002243674469679058,0.077:8.012577174900807E-4,0.007050:3.9184997476998896E-5,0.03:3.2141106779800255E-5,0.0625:2.4665616652494003E-5,0.02:1.949377177063371E-5,0.025:1.3329985998932362E-5}
> ....
>
> $MAHOUT cvb \
> -i ${WORK_DIR}/reuters-out-matrix/matrix \
> -o ${WORK_DIR}/reuters-lda -k 20 -ow -x 20
> \
> -dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
> -dt ${WORK_DIR}/reuters-lda-topics \
> -mt ${WORK_DIR}/reuters-lda-model \
> && \
> $MAHOUT vectordump \
> -i ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
> -o ${WORK_DIR}/reuters-lda/vectordump \
> -vs 10 -p true \
> -d ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
> -dt sequencefile -sort ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
> && \
>
> I must do something to output from this on?
>
> The same thing happens when i tried to implement on my own
>
>
> Thnaks in advance
>
Re: Reuters Example LDA Error (no help anywhere)
Posted by Suneel Marthi <su...@yahoo.com>.
Typo in previous email, read as:
"Ideally Mahout's missing a clusterdump like utility for that reads in LDA
topics, Document - DocumentId mapping and displays a report of the
topics and the documents that belong to a topic."
On Thursday, March 6, 2014 7:06 PM, Suneel Marthi <su...@yahoo.com> wrote:
The script needs to be corrected to not call vectordump for LDA as vectordump utility (or even clusterdump) are presently not capable of displaying topics and relevant documents. I recall this issue was previously reported by Peyman Faratin post 0.9 release.
Ideally Mahout's missing a clusterdump utility for that reads in LDA topics, Document - DocumentId mapping and displays a report of the topics and the documents that belong to a topic.
Meanwhile in order to see the generated topics and documents please refer to this blog: http://sujitpal.blogspot.com/2013/10/topic-modeling-with-mahout-on-amazon-emr.html
Let me file a JIRA for this.
On Thursday, March 6, 2014 6:12 PM, Cosmin Dumbrava <of...@gmail.com> wrote:
I don't know if is ok to mail on this address like this but... there is
I have executed cluster-reuters.sh from example directory (vers 1.0
SNAPSHOT) and at the end i only get a list of
.....
21575
{0.02:0.6314297270431626,0.03:
0.12547216143460152,0.007050:0.08061044448337305,0.04:0.07121802301642256,0.025:0.0677648308012434,0.003:0.0221466872297289,0.06:4.4720109631453837E-4,0.01:4.0331445050718065E-4,0.077:1.0509017796402916E-4,0.1:6.868649426131684E-5}
21576
{0.055:0.7123345754234253,0.003:0.10345316403842542,0.025:0.07850931669910466,0.1:0.0688641506163345,0.06:0.010599081492449824,0:0.0081953368778766,0.04:0.00469907695241742,0.03:0.003966985061879055,0.07:0.002197060890631658,0.0625:0.0020741956232281466}
21577
{0.04:0.5277733526037044,0.01:0.46656672162804314,0.07:0.0024295914763474164,0.1:0.002243674469679058,0.077:8.012577174900807E-4,0.007050:3.9184997476998896E-5,0.03:3.2141106779800255E-5,0.0625:2.4665616652494003E-5,0.02:1.949377177063371E-5,0.025:1.3329985998932362E-5}
....
$MAHOUT cvb \
-i ${WORK_DIR}/reuters-out-matrix/matrix \
-o ${WORK_DIR}/reuters-lda -k 20 -ow -x 20
\
-dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
-dt ${WORK_DIR}/reuters-lda-topics \
-mt ${WORK_DIR}/reuters-lda-model \
&& \
$MAHOUT vectordump \
-i ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
-o ${WORK_DIR}/reuters-lda/vectordump \
-vs 10 -p true \
-d ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
-dt sequencefile -sort ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
&& \
I must do something to output from this on?
The same thing happens when i tried to implement on my own
Thnaks in advance
Re: Reuters Example LDA Error (no help anywhere)
Posted by Suneel Marthi <su...@yahoo.com>.
The script needs to be corrected to not call vectordump for LDA as vectordump utility (or even clusterdump) are presently not capable of displaying topics and relevant documents. I recall this issue was previously reported by Peyman Faratin post 0.9 release.
Ideally Mahout's missing a clusterdump utility for that reads in LDA topics, Document - DocumentId mapping and displays a report of the topics and the documents that belong to a topic.
Meanwhile in order to see the generated topics and documents please refer to this blog: http://sujitpal.blogspot.com/2013/10/topic-modeling-with-mahout-on-amazon-emr.html
Let me file a JIRA for this.
On Thursday, March 6, 2014 6:12 PM, Cosmin Dumbrava <of...@gmail.com> wrote:
I don't know if is ok to mail on this address like this but... there is
I have executed cluster-reuters.sh from example directory (vers 1.0
SNAPSHOT) and at the end i only get a list of
.....
21575
{0.02:0.6314297270431626,0.03:
0.12547216143460152,0.007050:0.08061044448337305,0.04:0.07121802301642256,0.025:0.0677648308012434,0.003:0.0221466872297289,0.06:4.4720109631453837E-4,0.01:4.0331445050718065E-4,0.077:1.0509017796402916E-4,0.1:6.868649426131684E-5}
21576
{0.055:0.7123345754234253,0.003:0.10345316403842542,0.025:0.07850931669910466,0.1:0.0688641506163345,0.06:0.010599081492449824,0:0.0081953368778766,0.04:0.00469907695241742,0.03:0.003966985061879055,0.07:0.002197060890631658,0.0625:0.0020741956232281466}
21577
{0.04:0.5277733526037044,0.01:0.46656672162804314,0.07:0.0024295914763474164,0.1:0.002243674469679058,0.077:8.012577174900807E-4,0.007050:3.9184997476998896E-5,0.03:3.2141106779800255E-5,0.0625:2.4665616652494003E-5,0.02:1.949377177063371E-5,0.025:1.3329985998932362E-5}
....
$MAHOUT cvb \
-i ${WORK_DIR}/reuters-out-matrix/matrix \
-o ${WORK_DIR}/reuters-lda -k 20 -ow -x 20
\
-dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
-dt ${WORK_DIR}/reuters-lda-topics \
-mt ${WORK_DIR}/reuters-lda-model \
&& \
$MAHOUT vectordump \
-i ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
-o ${WORK_DIR}/reuters-lda/vectordump \
-vs 10 -p true \
-d ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
-dt sequencefile -sort ${WORK_DIR}/reuters-lda-topics/part-m-00000 \
&& \
I must do something to output from this on?
The same thing happens when i tried to implement on my own
Thnaks in advance