You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Mohammed Omer <be...@gmail.com> on 2014/07/13 20:06:55 UTC

CVB: Incorrect mapping between p(topic | term) and p(doc | topic) dump files

All - I'm having the same issue as mentioned at
http://comments.gmane.org/gmane.comp.apache.mahout.user/18889 on Mahout
0.9. My CVB clusters describe my corpus well; however, the mapping file
generated by mahout's `rowid` seems to be wayyyyyy off.

For example, there's a very obvious cluster which has keywords like "beer,
stout, pale" - the only cluster to contain these keywords. In my vectordump
for the p(term | topic) this cluster is at line 217. Vector dump generated
by:

echo `date` ": Dumping the p(term | topic) vectors to local filesystem..."
$mahout_bin/mahout vectordump -i results/cvb_results/to_out \
  --dictionary results/seq2sparse_results/dictionary.file-0 \
  --vectorSize $NUM_KEYWORDS -sort results/cvb_results/to_out \
  -o $OUTPUT_DIR/$PTOPIC_TERM_FILE -dt sequencefile

And, while the results of dumping out the p(doc | topic) group all of the
documents which contain the words "beer, stout, pale" together - it dumps
them into cluster number 8. The dump is created via:

echo `date` ": Dumping the p(doc | topic) vectors to local filesystem..."
$mahout_bin/mahout vectordump -i results/cvb_results/do_out \
  -sort results/cvb_results/do_out \
  -o $OUTPUT_DIR/$PDOC_TOPIC_FILE -p true -c csv -n true -u true

IE: the result from the p(doc | topic) dump will result in:

123    0.001,...,0.60,...

Where 123 maps to a document about "beer, stout, pale" and where 0.60 is
the 9th comma separated value -- thus belonging to cluster id#8 (at zero
index).

However, if we look at the p(term | topic) file dumped earlier, cluster
id#8 has nothing to do with this document.

Additionally, I wrote a script to review all of the documents belonging to
any given cluster; and, all of the documents in cluster #8 actually map to
the p(term|topic) entry described by cluster #217. That is to say, these
are the only documents containing the ngrams / keywords that cluster #217
shows as describing it.

I can't figure out where the gap is: Is it in the rowid docIndex/matrix I
have? I've tried dumping the above two files without sorting as I figured
that might be rearranging the ordering of cluster probabilities in the
p(doc | topic) dump, but that turned up inconclusive I believe.

I would love any ideas - I've been stumped on this for a little while now.

Thank you,

Mo

Re: CVB: Incorrect mapping between p(topic | term) and p(doc | topic) dump files

Posted by Mohammed Omer <be...@gmail.com>.

All, I must have missed a param earlier; but, it seems that the below
results in an export that includes the keys. Derp. See below:

mahout vectordump -i results/cvb_results/to_out \

--dictionary results/seq2sparse_results/dictionary.file-0 \

--vectorSize $NUM_KEYWORDS -sort true \

-o $OUTPUT_DIR/$PTOPIC_TERM_FILE -dt sequencefile -n true -u true -p true





On Mon, Jul 14, 2014 at 4:42 PM, Mohammed Omer <be...@gmail.com>
wrote:

> Quick, brief update to all who are looking into this:
>
> It's become apparent that due to the inability to include a given Topic's
> ID when using `vectordump` with a dictionary file, that I'll likely have to
> resort to using `seqdumper` to dump out the term|topics and then use
> `seqdumper` again to dump out the dictionary file, and finally write my own
> map job to join the two items together.
>
> Issue resolved, I'll write a post on this in detail for others to learn
> from and reference. If anyone comes up with a more streamlined solution,
> I'll still donate the full $200 to Apache; otherwise, I'll throw in $100
> next week.
>
> Thank you all for your work on Mahout.
>
> Mo
>
>
> On Mon, Jul 14, 2014 at 3:37 PM, Mohammed Omer <be...@gmail.com>
> wrote:
>
>> All - to help illustrate the issue, I've put together my mahout cvb
>> script and some truncated output files here for your review with real data:
>>
>> https://gist.github.com/momer/3ddaaa0c291a91d25709
>>
>> Not sure if this is frowned upon, but to expedite some eyes on this
>> issue, I'll donate $200 to the Apache foundation if we can figure this out
>> by the end of the week; and, $100 if we can figure it out by the end of
>> next week!
>>
>> Thank you,
>>
>> Mo
>>
>>
>> On Sun, Jul 13, 2014 at 1:06 PM, Mohammed Omer <be...@gmail.com>
>> wrote:
>>
>>> All - I'm having the same issue as mentioned at
>>> http://comments.gmane.org/gmane.comp.apache.mahout.user/18889 on Mahout
>>> 0.9. My CVB clusters describe my corpus well; however, the mapping file
>>> generated by mahout's `rowid` seems to be wayyyyyy off.
>>>
>>> For example, there's a very obvious cluster which has keywords like
>>> "beer, stout, pale" - the only cluster to contain these keywords. In my
>>> vectordump for the p(term | topic) this cluster is at line 217. Vector dump
>>> generated by:
>>>
>>> echo `date` ": Dumping the p(term | topic) vectors to local
>>> filesystem..."
>>> $mahout_bin/mahout vectordump -i results/cvb_results/to_out \
>>>   --dictionary results/seq2sparse_results/dictionary.file-0 \
>>>   --vectorSize $NUM_KEYWORDS -sort results/cvb_results/to_out \
>>>   -o $OUTPUT_DIR/$PTOPIC_TERM_FILE -dt sequencefile
>>>
>>> And, while the results of dumping out the p(doc | topic) group all of
>>> the documents which contain the words "beer, stout, pale" together - it
>>> dumps them into cluster number 8. The dump is created via:
>>>
>>> echo `date` ": Dumping the p(doc | topic) vectors to local filesystem..."
>>> $mahout_bin/mahout vectordump -i results/cvb_results/do_out \
>>>   -sort results/cvb_results/do_out \
>>>   -o $OUTPUT_DIR/$PDOC_TOPIC_FILE -p true -c csv -n true -u true
>>>
>>> IE: the result from the p(doc | topic) dump will result in:
>>>
>>> 123    0.001,...,0.60,...
>>>
>>> Where 123 maps to a document about "beer, stout, pale" and where 0.60 is
>>> the 9th comma separated value -- thus belonging to cluster id#8 (at zero
>>> index).
>>>
>>> However, if we look at the p(term | topic) file dumped earlier, cluster
>>> id#8 has nothing to do with this document.
>>>
>>> Additionally, I wrote a script to review all of the documents belonging
>>> to any given cluster; and, all of the documents in cluster #8 actually map
>>> to the p(term|topic) entry described by cluster #217. That is to say, these
>>> are the only documents containing the ngrams / keywords that cluster #217
>>> shows as describing it.
>>>
>>> I can't figure out where the gap is: Is it in the rowid docIndex/matrix
>>> I have? I've tried dumping the above two files without sorting as I figured
>>> that might be rearranging the ordering of cluster probabilities in the
>>> p(doc | topic) dump, but that turned up inconclusive I believe.
>>>
>>> I would love any ideas - I've been stumped on this for a little while
>>> now.
>>>
>>> Thank you,
>>>
>>> Mo
>>>
>>
>>
>

Re: CVB: Incorrect mapping between p(topic | term) and p(doc | topic) dump files

Posted by Mohammed Omer <be...@gmail.com>.

Quick, brief update to all who are looking into this:

It's become apparent that due to the inability to include a given Topic's
ID when using `vectordump` with a dictionary file, that I'll likely have to
resort to using `seqdumper` to dump out the term|topics and then use
`seqdumper` again to dump out the dictionary file, and finally write my own
map job to join the two items together.

Issue resolved, I'll write a post on this in detail for others to learn
from and reference. If anyone comes up with a more streamlined solution,
I'll still donate the full $200 to Apache; otherwise, I'll throw in $100
next week.

Thank you all for your work on Mahout.

Mo


On Mon, Jul 14, 2014 at 3:37 PM, Mohammed Omer <be...@gmail.com>
wrote:

> All - to help illustrate the issue, I've put together my mahout cvb script
> and some truncated output files here for your review with real data:
>
> https://gist.github.com/momer/3ddaaa0c291a91d25709
>
> Not sure if this is frowned upon, but to expedite some eyes on this issue,
> I'll donate $200 to the Apache foundation if we can figure this out by the
> end of the week; and, $100 if we can figure it out by the end of next week!
>
> Thank you,
>
> Mo
>
>
> On Sun, Jul 13, 2014 at 1:06 PM, Mohammed Omer <be...@gmail.com>
> wrote:
>
>> All - I'm having the same issue as mentioned at
>> http://comments.gmane.org/gmane.comp.apache.mahout.user/18889 on Mahout
>> 0.9. My CVB clusters describe my corpus well; however, the mapping file
>> generated by mahout's `rowid` seems to be wayyyyyy off.
>>
>> For example, there's a very obvious cluster which has keywords like
>> "beer, stout, pale" - the only cluster to contain these keywords. In my
>> vectordump for the p(term | topic) this cluster is at line 217. Vector dump
>> generated by:
>>
>> echo `date` ": Dumping the p(term | topic) vectors to local filesystem..."
>> $mahout_bin/mahout vectordump -i results/cvb_results/to_out \
>>   --dictionary results/seq2sparse_results/dictionary.file-0 \
>>   --vectorSize $NUM_KEYWORDS -sort results/cvb_results/to_out \
>>   -o $OUTPUT_DIR/$PTOPIC_TERM_FILE -dt sequencefile
>>
>> And, while the results of dumping out the p(doc | topic) group all of the
>> documents which contain the words "beer, stout, pale" together - it dumps
>> them into cluster number 8. The dump is created via:
>>
>> echo `date` ": Dumping the p(doc | topic) vectors to local filesystem..."
>> $mahout_bin/mahout vectordump -i results/cvb_results/do_out \
>>   -sort results/cvb_results/do_out \
>>   -o $OUTPUT_DIR/$PDOC_TOPIC_FILE -p true -c csv -n true -u true
>>
>> IE: the result from the p(doc | topic) dump will result in:
>>
>> 123    0.001,...,0.60,...
>>
>> Where 123 maps to a document about "beer, stout, pale" and where 0.60 is
>> the 9th comma separated value -- thus belonging to cluster id#8 (at zero
>> index).
>>
>> However, if we look at the p(term | topic) file dumped earlier, cluster
>> id#8 has nothing to do with this document.
>>
>> Additionally, I wrote a script to review all of the documents belonging
>> to any given cluster; and, all of the documents in cluster #8 actually map
>> to the p(term|topic) entry described by cluster #217. That is to say, these
>> are the only documents containing the ngrams / keywords that cluster #217
>> shows as describing it.
>>
>> I can't figure out where the gap is: Is it in the rowid docIndex/matrix I
>> have? I've tried dumping the above two files without sorting as I figured
>> that might be rearranging the ordering of cluster probabilities in the
>> p(doc | topic) dump, but that turned up inconclusive I believe.
>>
>> I would love any ideas - I've been stumped on this for a little while now.
>>
>> Thank you,
>>
>> Mo
>>
>
>

Re: CVB: Incorrect mapping between p(topic | term) and p(doc | topic) dump files

Posted by Mohammed Omer <be...@gmail.com>.

All - to help illustrate the issue, I've put together my mahout cvb script
and some truncated output files here for your review with real data:

https://gist.github.com/momer/3ddaaa0c291a91d25709

Not sure if this is frowned upon, but to expedite some eyes on this issue,
I'll donate $200 to the Apache foundation if we can figure this out by the
end of the week; and, $100 if we can figure it out by the end of next week!

Thank you,

Mo


On Sun, Jul 13, 2014 at 1:06 PM, Mohammed Omer <be...@gmail.com>
wrote:

> All - I'm having the same issue as mentioned at
> http://comments.gmane.org/gmane.comp.apache.mahout.user/18889 on Mahout
> 0.9. My CVB clusters describe my corpus well; however, the mapping file
> generated by mahout's `rowid` seems to be wayyyyyy off.
>
> For example, there's a very obvious cluster which has keywords like "beer,
> stout, pale" - the only cluster to contain these keywords. In my vectordump
> for the p(term | topic) this cluster is at line 217. Vector dump generated
> by:
>
> echo `date` ": Dumping the p(term | topic) vectors to local filesystem..."
> $mahout_bin/mahout vectordump -i results/cvb_results/to_out \
>   --dictionary results/seq2sparse_results/dictionary.file-0 \
>   --vectorSize $NUM_KEYWORDS -sort results/cvb_results/to_out \
>   -o $OUTPUT_DIR/$PTOPIC_TERM_FILE -dt sequencefile
>
> And, while the results of dumping out the p(doc | topic) group all of the
> documents which contain the words "beer, stout, pale" together - it dumps
> them into cluster number 8. The dump is created via:
>
> echo `date` ": Dumping the p(doc | topic) vectors to local filesystem..."
> $mahout_bin/mahout vectordump -i results/cvb_results/do_out \
>   -sort results/cvb_results/do_out \
>   -o $OUTPUT_DIR/$PDOC_TOPIC_FILE -p true -c csv -n true -u true
>
> IE: the result from the p(doc | topic) dump will result in:
>
> 123    0.001,...,0.60,...
>
> Where 123 maps to a document about "beer, stout, pale" and where 0.60 is
> the 9th comma separated value -- thus belonging to cluster id#8 (at zero
> index).
>
> However, if we look at the p(term | topic) file dumped earlier, cluster
> id#8 has nothing to do with this document.
>
> Additionally, I wrote a script to review all of the documents belonging to
> any given cluster; and, all of the documents in cluster #8 actually map to
> the p(term|topic) entry described by cluster #217. That is to say, these
> are the only documents containing the ngrams / keywords that cluster #217
> shows as describing it.
>
> I can't figure out where the gap is: Is it in the rowid docIndex/matrix I
> have? I've tried dumping the above two files without sorting as I figured
> that might be rearranging the ordering of cluster probabilities in the
> p(doc | topic) dump, but that turned up inconclusive I believe.
>
> I would love any ideas - I've been stumped on this for a little while now.
>
> Thank you,
>
> Mo
>