You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Gokhan Capan <gk...@gmail.com> on 2012/08/06 13:00:24 UTC

LDA Questions

Hi,

My question is about interpreting lda document-topics output.

I am using trunk.

I have a directory of documents, each of which are named by integers, and
there is no sub-directory of the data directory.
The directory structure is as follows
$ ls /path/to/data/
   1
   2
   5
   ...

>From those documents I want to detect topics, and output:
- topic - top terms
- document - top topics

To this end, I first run seqdirectory on the directory:
$ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1

Then I run seq2sparse to create tf vectors of documents:
$ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF --maxDFSigma 3
--namedVector

After creating vectors, I run cvb0_local on those tf-vectors:
$ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to
$LDA_OUT/words -top 20 -m 50 --dictionary $SPARSEDIR/dictionary.file-0

And to interpret the results, I use mahout's vectordump utility:
$ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize 10
-sort true -p true

$ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words --dictionary
$SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile --vectorSize 10
-sort true -p true

The resulting words file consists of #ofTopics lines.
I assume each line is in <topicID \t wordsVector> format, where a
wordsVector is a sorted vector whose elements are <word, score> pairs.

The resulting docs file on the other hand, consists of #ofDocuments lines.
I assume each line is in <documentID \t topicsVector> format, where a
topicsVector is a sorted vector whose elements are <topicID, probability>
pairs.

The problem is that the documentID field does not match with the original
document ids. This field is populated with zero-based auto-incrementing
indices.

I want to ask if I am missing something for vectordump to output correct
document ids, or this is the normal behavior when one runs lda on a
directory of documents, or I make a mistake in one of those steps.

I suspect the issue is seqdirectory assigns Text ids to documents, while
CVB algorithm expects documents in another format, <IntWritable,
VectorWritable>. If this is the case, could you help me for assigning
IntWritable ids to documents in the process of creating vectors from them?
Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to do so?

Thanks

-- 
Gokhan

Re: LDA Questions

Posted by Gokhan Capan <gk...@gmail.com>.
Hi Jake,

Today I submitted the diff. It is available at
https://issues.apache.org/jira/browse/MAHOUT-1051

Thanks for the advices

On Tue, Aug 7, 2012 at 1:06 AM, Jake Mannix <ja...@gmail.com> wrote:

> Sounds great Gokhan!
>
> On Mon, Aug 6, 2012 at 2:53 PM, Gokhan Capan <gk...@gmail.com> wrote:
>
> > Jake,
> >
> > I converted the ids to integers with rowid, and then
> > modified InMemoryCollapsedVariationBayes0.loadVectors() such that it
> > returns a SparseMatrix (instead of SparseRowMatrix) whose row ids are
> keys
> > from <IntWritable, VectorWritable> tf vectors. I am not sure if it works,
> > since the values of mapped integer ids (results of rowid) are in the
> range
> > [0, #ofDocuments), but I
> > believe it does.
> >
> > Constructing SparseMatrix needs RandomAccessSparseVector as row vectors
> and
> > tf-vectors are sparse vectors, so I assumed that an incoming tf vector
> > itself, or getDelegate if it is a NamedVector, can be cast to
> > RandomAccessSparseVector.
> > I will submit the diff tomorrow, so you can review and commit.
> >
> > Thank you for your help.
> >
> >
> > On Mon, Aug 6, 2012 at 8:19 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> >
> > > Hi Gokhan,
> > >
> > >   This looks like a bug in the
> > > InMemoryCollapsedVariationBayes0.loadVectors()
> > > method - it takes the SequenceFile<? extends Writable, VectorWritable>
> > and
> > > ignores
> > > the keys, assigning the rows in order into an in-memory Matrix.
> > >
> > >   If you run "$MAHOUT_HOME/bin/mahout rowid -i <your tf-vector-path> -o
> > > <output path>"
> > > this converts Text keys into IntWritable keys (and leaves behind an
> index
> > > file, a mapping
> > > of Text -> IntWritable which tells you which int is assigned to which
> > > original text key).
> > >
> > >   Then what you'd want to do is modify
> > > InMemoryCollapsedVariationBayes0.loadVectors()
> > > to actually use the keys which are given to it, instead of reassigning
> to
> > > sequential
> > > ids.  If you make this change, we'd love to have the diff - not too
> many
> > > people use
> > > the cvb0_local path (it's usually used for debugging and testing
> smaller
> > > data sets to see that topics are converging properly), but getting it
> to
> > > actually produce
> > > document -> topic outputs which correlate with original docIds would be
> > > very nice! :)
> > >
> > > On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan <gk...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > My question is about interpreting lda document-topics output.
> > > >
> > > > I am using trunk.
> > > >
> > > > I have a directory of documents, each of which are named by integers,
> > and
> > > > there is no sub-directory of the data directory.
> > > > The directory structure is as follows
> > > > $ ls /path/to/data/
> > > >    1
> > > >    2
> > > >    5
> > > >    ...
> > > >
> > > > From those documents I want to detect topics, and output:
> > > > - topic - top terms
> > > > - document - top topics
> > > >
> > > > To this end, I first run seqdirectory on the directory:
> > > > $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1
> > > >
> > > > Then I run seq2sparse to create tf vectors of documents:
> > > > $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF
> --maxDFSigma 3
> > > > --namedVector
> > > >
> > > > After creating vectors, I run cvb0_local on those tf-vectors:
> > > > $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to
> > > > $LDA_OUT/words -top 20 -m 50 --dictionary
> $SPARSEDIR/dictionary.file-0
> > > >
> > > > And to interpret the results, I use mahout's vectordump utility:
> > > > $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize
> > 10
> > > > -sort true -p true
> > > >
> > > > $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words
> --dictionary
> > > > $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile
> --vectorSize
> > > 10
> > > > -sort true -p true
> > > >
> > > > The resulting words file consists of #ofTopics lines.
> > > > I assume each line is in <topicID \t wordsVector> format, where a
> > > > wordsVector is a sorted vector whose elements are <word, score>
> pairs.
> > > >
> > > > The resulting docs file on the other hand, consists of #ofDocuments
> > > lines.
> > > > I assume each line is in <documentID \t topicsVector> format, where a
> > > > topicsVector is a sorted vector whose elements are <topicID,
> > probability>
> > > > pairs.
> > > >
> > > > The problem is that the documentID field does not match with the
> > original
> > > > document ids. This field is populated with zero-based
> auto-incrementing
> > > > indices.
> > > >
> > > > I want to ask if I am missing something for vectordump to output
> > correct
> > > > document ids, or this is the normal behavior when one runs lda on a
> > > > directory of documents, or I make a mistake in one of those steps.
> > > >
> > > > I suspect the issue is seqdirectory assigns Text ids to documents,
> > while
> > > > CVB algorithm expects documents in another format, <IntWritable,
> > > > VectorWritable>. If this is the case, could you help me for assigning
> > > > IntWritable ids to documents in the process of creating vectors from
> > > them?
> > > > Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to
> do
> > > so?
> > > >
> > > > Thanks
> > > >
> > > > --
> > > > Gokhan
> > > >
> > >
> > >
> > >
> > > --
> > >
> > >   -jake
> > >
> >
> >
> >
> > --
> > Gokhan
> >
>
>
>
> --
>
>   -jake
>



-- 
Gokhan

Re: LDA Questions

Posted by Jake Mannix <ja...@gmail.com>.
Sounds great Gokhan!

On Mon, Aug 6, 2012 at 2:53 PM, Gokhan Capan <gk...@gmail.com> wrote:

> Jake,
>
> I converted the ids to integers with rowid, and then
> modified InMemoryCollapsedVariationBayes0.loadVectors() such that it
> returns a SparseMatrix (instead of SparseRowMatrix) whose row ids are keys
> from <IntWritable, VectorWritable> tf vectors. I am not sure if it works,
> since the values of mapped integer ids (results of rowid) are in the range
> [0, #ofDocuments), but I
> believe it does.
>
> Constructing SparseMatrix needs RandomAccessSparseVector as row vectors and
> tf-vectors are sparse vectors, so I assumed that an incoming tf vector
> itself, or getDelegate if it is a NamedVector, can be cast to
> RandomAccessSparseVector.
> I will submit the diff tomorrow, so you can review and commit.
>
> Thank you for your help.
>
>
> On Mon, Aug 6, 2012 at 8:19 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> > Hi Gokhan,
> >
> >   This looks like a bug in the
> > InMemoryCollapsedVariationBayes0.loadVectors()
> > method - it takes the SequenceFile<? extends Writable, VectorWritable>
> and
> > ignores
> > the keys, assigning the rows in order into an in-memory Matrix.
> >
> >   If you run "$MAHOUT_HOME/bin/mahout rowid -i <your tf-vector-path> -o
> > <output path>"
> > this converts Text keys into IntWritable keys (and leaves behind an index
> > file, a mapping
> > of Text -> IntWritable which tells you which int is assigned to which
> > original text key).
> >
> >   Then what you'd want to do is modify
> > InMemoryCollapsedVariationBayes0.loadVectors()
> > to actually use the keys which are given to it, instead of reassigning to
> > sequential
> > ids.  If you make this change, we'd love to have the diff - not too many
> > people use
> > the cvb0_local path (it's usually used for debugging and testing smaller
> > data sets to see that topics are converging properly), but getting it to
> > actually produce
> > document -> topic outputs which correlate with original docIds would be
> > very nice! :)
> >
> > On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan <gk...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > My question is about interpreting lda document-topics output.
> > >
> > > I am using trunk.
> > >
> > > I have a directory of documents, each of which are named by integers,
> and
> > > there is no sub-directory of the data directory.
> > > The directory structure is as follows
> > > $ ls /path/to/data/
> > >    1
> > >    2
> > >    5
> > >    ...
> > >
> > > From those documents I want to detect topics, and output:
> > > - topic - top terms
> > > - document - top topics
> > >
> > > To this end, I first run seqdirectory on the directory:
> > > $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1
> > >
> > > Then I run seq2sparse to create tf vectors of documents:
> > > $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF --maxDFSigma 3
> > > --namedVector
> > >
> > > After creating vectors, I run cvb0_local on those tf-vectors:
> > > $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to
> > > $LDA_OUT/words -top 20 -m 50 --dictionary $SPARSEDIR/dictionary.file-0
> > >
> > > And to interpret the results, I use mahout's vectordump utility:
> > > $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize
> 10
> > > -sort true -p true
> > >
> > > $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words --dictionary
> > > $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile --vectorSize
> > 10
> > > -sort true -p true
> > >
> > > The resulting words file consists of #ofTopics lines.
> > > I assume each line is in <topicID \t wordsVector> format, where a
> > > wordsVector is a sorted vector whose elements are <word, score> pairs.
> > >
> > > The resulting docs file on the other hand, consists of #ofDocuments
> > lines.
> > > I assume each line is in <documentID \t topicsVector> format, where a
> > > topicsVector is a sorted vector whose elements are <topicID,
> probability>
> > > pairs.
> > >
> > > The problem is that the documentID field does not match with the
> original
> > > document ids. This field is populated with zero-based auto-incrementing
> > > indices.
> > >
> > > I want to ask if I am missing something for vectordump to output
> correct
> > > document ids, or this is the normal behavior when one runs lda on a
> > > directory of documents, or I make a mistake in one of those steps.
> > >
> > > I suspect the issue is seqdirectory assigns Text ids to documents,
> while
> > > CVB algorithm expects documents in another format, <IntWritable,
> > > VectorWritable>. If this is the case, could you help me for assigning
> > > IntWritable ids to documents in the process of creating vectors from
> > them?
> > > Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to do
> > so?
> > >
> > > Thanks
> > >
> > > --
> > > Gokhan
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>
>
>
> --
> Gokhan
>



-- 

  -jake

Re: LDA Questions

Posted by Gokhan Capan <gk...@gmail.com>.
Jake,

I converted the ids to integers with rowid, and then
modified InMemoryCollapsedVariationBayes0.loadVectors() such that it
returns a SparseMatrix (instead of SparseRowMatrix) whose row ids are keys
from <IntWritable, VectorWritable> tf vectors. I am not sure if it works,
since the values of mapped integer ids (results of rowid) are in the range
[0, #ofDocuments), but I
believe it does.

Constructing SparseMatrix needs RandomAccessSparseVector as row vectors and
tf-vectors are sparse vectors, so I assumed that an incoming tf vector
itself, or getDelegate if it is a NamedVector, can be cast to
RandomAccessSparseVector.
I will submit the diff tomorrow, so you can review and commit.

Thank you for your help.


On Mon, Aug 6, 2012 at 8:19 PM, Jake Mannix <ja...@gmail.com> wrote:

> Hi Gokhan,
>
>   This looks like a bug in the
> InMemoryCollapsedVariationBayes0.loadVectors()
> method - it takes the SequenceFile<? extends Writable, VectorWritable> and
> ignores
> the keys, assigning the rows in order into an in-memory Matrix.
>
>   If you run "$MAHOUT_HOME/bin/mahout rowid -i <your tf-vector-path> -o
> <output path>"
> this converts Text keys into IntWritable keys (and leaves behind an index
> file, a mapping
> of Text -> IntWritable which tells you which int is assigned to which
> original text key).
>
>   Then what you'd want to do is modify
> InMemoryCollapsedVariationBayes0.loadVectors()
> to actually use the keys which are given to it, instead of reassigning to
> sequential
> ids.  If you make this change, we'd love to have the diff - not too many
> people use
> the cvb0_local path (it's usually used for debugging and testing smaller
> data sets to see that topics are converging properly), but getting it to
> actually produce
> document -> topic outputs which correlate with original docIds would be
> very nice! :)
>
> On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan <gk...@gmail.com> wrote:
>
> > Hi,
> >
> > My question is about interpreting lda document-topics output.
> >
> > I am using trunk.
> >
> > I have a directory of documents, each of which are named by integers, and
> > there is no sub-directory of the data directory.
> > The directory structure is as follows
> > $ ls /path/to/data/
> >    1
> >    2
> >    5
> >    ...
> >
> > From those documents I want to detect topics, and output:
> > - topic - top terms
> > - document - top topics
> >
> > To this end, I first run seqdirectory on the directory:
> > $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1
> >
> > Then I run seq2sparse to create tf vectors of documents:
> > $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF --maxDFSigma 3
> > --namedVector
> >
> > After creating vectors, I run cvb0_local on those tf-vectors:
> > $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to
> > $LDA_OUT/words -top 20 -m 50 --dictionary $SPARSEDIR/dictionary.file-0
> >
> > And to interpret the results, I use mahout's vectordump utility:
> > $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize 10
> > -sort true -p true
> >
> > $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words --dictionary
> > $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile --vectorSize
> 10
> > -sort true -p true
> >
> > The resulting words file consists of #ofTopics lines.
> > I assume each line is in <topicID \t wordsVector> format, where a
> > wordsVector is a sorted vector whose elements are <word, score> pairs.
> >
> > The resulting docs file on the other hand, consists of #ofDocuments
> lines.
> > I assume each line is in <documentID \t topicsVector> format, where a
> > topicsVector is a sorted vector whose elements are <topicID, probability>
> > pairs.
> >
> > The problem is that the documentID field does not match with the original
> > document ids. This field is populated with zero-based auto-incrementing
> > indices.
> >
> > I want to ask if I am missing something for vectordump to output correct
> > document ids, or this is the normal behavior when one runs lda on a
> > directory of documents, or I make a mistake in one of those steps.
> >
> > I suspect the issue is seqdirectory assigns Text ids to documents, while
> > CVB algorithm expects documents in another format, <IntWritable,
> > VectorWritable>. If this is the case, could you help me for assigning
> > IntWritable ids to documents in the process of creating vectors from
> them?
> > Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to do
> so?
> >
> > Thanks
> >
> > --
> > Gokhan
> >
>
>
>
> --
>
>   -jake
>



-- 
Gokhan

Re: LDA Questions

Posted by Jake Mannix <ja...@gmail.com>.
Hi Gokhan,

  This looks like a bug in the
InMemoryCollapsedVariationBayes0.loadVectors()
method - it takes the SequenceFile<? extends Writable, VectorWritable> and
ignores
the keys, assigning the rows in order into an in-memory Matrix.

  If you run "$MAHOUT_HOME/bin/mahout rowid -i <your tf-vector-path> -o
<output path>"
this converts Text keys into IntWritable keys (and leaves behind an index
file, a mapping
of Text -> IntWritable which tells you which int is assigned to which
original text key).

  Then what you'd want to do is modify
InMemoryCollapsedVariationBayes0.loadVectors()
to actually use the keys which are given to it, instead of reassigning to
sequential
ids.  If you make this change, we'd love to have the diff - not too many
people use
the cvb0_local path (it's usually used for debugging and testing smaller
data sets to see that topics are converging properly), but getting it to
actually produce
document -> topic outputs which correlate with original docIds would be
very nice! :)

On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan <gk...@gmail.com> wrote:

> Hi,
>
> My question is about interpreting lda document-topics output.
>
> I am using trunk.
>
> I have a directory of documents, each of which are named by integers, and
> there is no sub-directory of the data directory.
> The directory structure is as follows
> $ ls /path/to/data/
>    1
>    2
>    5
>    ...
>
> From those documents I want to detect topics, and output:
> - topic - top terms
> - document - top topics
>
> To this end, I first run seqdirectory on the directory:
> $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1
>
> Then I run seq2sparse to create tf vectors of documents:
> $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF --maxDFSigma 3
> --namedVector
>
> After creating vectors, I run cvb0_local on those tf-vectors:
> $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to
> $LDA_OUT/words -top 20 -m 50 --dictionary $SPARSEDIR/dictionary.file-0
>
> And to interpret the results, I use mahout's vectordump utility:
> $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize 10
> -sort true -p true
>
> $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words --dictionary
> $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile --vectorSize 10
> -sort true -p true
>
> The resulting words file consists of #ofTopics lines.
> I assume each line is in <topicID \t wordsVector> format, where a
> wordsVector is a sorted vector whose elements are <word, score> pairs.
>
> The resulting docs file on the other hand, consists of #ofDocuments lines.
> I assume each line is in <documentID \t topicsVector> format, where a
> topicsVector is a sorted vector whose elements are <topicID, probability>
> pairs.
>
> The problem is that the documentID field does not match with the original
> document ids. This field is populated with zero-based auto-incrementing
> indices.
>
> I want to ask if I am missing something for vectordump to output correct
> document ids, or this is the normal behavior when one runs lda on a
> directory of documents, or I make a mistake in one of those steps.
>
> I suspect the issue is seqdirectory assigns Text ids to documents, while
> CVB algorithm expects documents in another format, <IntWritable,
> VectorWritable>. If this is the case, could you help me for assigning
> IntWritable ids to documents in the process of creating vectors from them?
> Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to do so?
>
> Thanks
>
> --
> Gokhan
>



-- 

  -jake