You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2018/03/14 21:53:00 UTC
[jira] [Closed] (MADLIB-1201) Inconsistent lda output tables
[ https://issues.apache.org/jira/browse/MADLIB-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank McQuillan closed MADLIB-1201.
-----------------------------------
Resolution: Fixed
> Inconsistent lda output tables
> ------------------------------
>
> Key: MADLIB-1201
> URL: https://issues.apache.org/jira/browse/MADLIB-1201
> Project: Apache MADlib
> Issue Type: Bug
> Components: Module: Parallel Latent Dirichlet Allocation
> Reporter: Jingyi Mei
> Assignee: Jingyi Mei
> Priority: Major
> Fix For: v1.14
>
>
> We found an inconsistency in the LDA module between the outputs of lda_train and lda_get_word_topic_count.
> Repro Steps
> {code}
> DROP TABLE IF EXISTS documents;
> CREATE TABLE documents(docid INT4, contents TEXT);
> INSERT INTO documents VALUES
> (0, ' b a a c'),
> (1, ' d e f f f ');
> ALTER TABLE documents ADD COLUMN words TEXT[];
> UPDATE documents SET words = regexp_split_to_array(lower(contents), E'[\\s+\\.\\,]');
> DROP TABLE IF EXISTS my_training, my_training_vocabulary;
> SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training', TRUE);
> DROP TABLE IF EXISTS my_model, my_outdata;
> SELECT madlib.lda_train( 'my_training',
> 'my_model',
> 'my_outdata',
> 7,
> 2,
> 1,
> 5,
> 0.01
> );
> select * from my_outdata order by docid;
> ```
> docid | wordcount | words | counts | topic_count | topic_assignment
> -------+-----------+-----------+-----------+-------------+------------------
> 0 | 5 | {2,1,0,3} | {1,2,1,1} | {2,3} | {0,1,1,1,0}
> 1 | 7 | {4,5,0,6} | {1,1,2,3} | {1,6} | {1,0,1,1,1,1,1}
> ```
> DROP TABLE IF EXISTS my_word_topic_count;
> SELECT madlib.lda_get_word_topic_count( 'my_model', 'my_word_topic_count');
> SELECT * FROM my_word_topic_count ORDER BY wordid;
> ```
> wordid | topic_count
> --------+-------------
> 0 | {1,2}
> 1 | {0,2}
> 2 | {1,0}
> 3 | {0,1}
> 4 | {1,0}
> 5 | {0,1}
> 6 | {0,3}
> (7 rows)
> ```
> {code}
> The output of 'my_outdata' indicates that wordid 3 gets assigned only to topic 0 but the output of my_word_topic_count indicates that wordid 3 gets assigned only to topic 1. This output seems to be inconsistent with each other.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)