You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Nikhil (JIRA)" <ji...@apache.org> on 2018/01/27 01:21:00 UTC

[jira] [Comment Edited] (MADLIB-1201) Inconsistent lda output tables

    [ https://issues.apache.org/jira/browse/MADLIB-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341818#comment-16341818 ] 

Nikhil edited comment on MADLIB-1201 at 1/27/18 1:20 AM:
---------------------------------------------------------

We found out the root cause behind the inconsistency. The problem is that we don't update the model after calling gibbs_sample on the output of __lda_count_topic_agg.  
 
Let's assume that we run the lda_train function for only 1 iteration.

So for iteration 1:
In lda_py,in:iteration function, two work tables are created and we alternate between them for work input and work output tables.  
This is how we insert data into work_table_0 
{code}
            INSERT INTO {work_table}
            SELECT
                docid, wordcount, words, counts,
                {schema_madlib}.__lda_random_assign(wordcount, {topic_num}) AS topics
            FROM {data_table}
{code}

This is how we insert data into work_table_1
{code}
                INSERT INTO {work_table_out}
                SELECT
                    docid, wordcount, words, counts,
                    {schema_madlib}.__lda_gibbs_sample(
                        words, counts, doc_topic,
                        (SELECT model FROM {model_table}),
                        {alpha}, {beta}, {voc_size}, {topic_num}, 1)
                FROM
                    {work_table_in}
{code}
where  work_table_out=work_table_1, work_table_in=work_table_0, model_table=output of __lda_count_topic_agg

Workflow

1. lda_train calls iteration which sets up work_table_0.
2. iteration calls __lda_count_topic_agg which takes as input part of the column 'doc_topic'  from work_table_0. This is passed to lda.cpp:lda_count_topic_sfunc as  'topic_assignment'.
3. The output of __lda_count_topic_agg is saved as the model. 
4. We call __lda_gibbs_sample on the newly created model and insert data into work_table_1. This is the final output but the model doesn't get updated with the output of lda_gibbs_sample.
5. `work_table_0` is used to create the table `my_word_topic_count` and `work_table_1` is used to create the table `my_outdata`

{code}
select * from __work_table_train_0__ ;
 docid | wordcount |   words   |  counts   |      doc_topic
-------+-----------+-----------+-----------+---------------------
     0 |         5 | {2,1,0,3} | {1,2,1,1} | {2,3,0,1,1,0,1}
     1 |         7 | {4,5,0,6} | {1,1,2,3} | {1,6,0,1,1,1,1,1,1}
(2 rows)

select * from __work_table_train_1__ ;
 docid | wordcount |   words   |  counts   |      doc_topic
-------+-----------+-----------+-----------+---------------------
     0 |         5 | {2,1,0,3} | {1,2,1,1} | {2,3,0,1,1,1,0}
     1 |         7 | {4,5,0,6} | {1,1,2,3} | {1,6,1,0,1,1,1,1,1}
(2 rows)
{code}

Since both the tables have differnt topic assignments, we see the inconsistency between `my_word_topic_count` and `my_outdata`. 
 
We tried to call __lda_count_topic_agg after lda_train() in our psql console and noticed that the model got updated and the results were consistent.
{code}
madlib=# create table my_model2 as select * from my_model ;

INSERT INTO my_model2
SELECT
    7,
    2,
    5,
    0.01,
    madlib.__lda_count_topic_agg(
        words,
        counts,
        doc_topic[2 + 1:array_upper(doc_topic, 1)],
        7,
        2
    ) AS model;
DROP TABLE IF EXISTS my_word_topic_count2;
SELECT madlib.lda_get_word_topic_count( 'my_model2', 'my_word_topic_count2');
SELECT * FROM my_word_topic_count2 ORDER BY wordid;
 wordid | topic_count
--------+-------------
      0 | {3,0}
      1 | {0,2}
      2 | {1,0}
      3 | {1,0}
      4 | {0,1}
      5 | {1,0}
      6 | {0,3}
(7 rows)
select * from my_outdata ;
 docid | wordcount |   words   |  counts   | topic_count | topic_assignment
-------+-----------+-----------+-----------+-------------+------------------
     0 |         5 | {2,1,0,3} | {1,2,1,1} | {3,2}       | {0,1,1,0,0}
     1 |         7 | {4,5,0,6} | {1,1,2,3} | {3,4}       | {1,0,0,0,1,1,1}
(2 rows)
{code}


was (Author: nikhilkak):
We found out the root cause behind the inconsistency but are not sure about the motivation behind the logic.
 
Let's assume that we run the lda_train function for only 1 iteration.

So for iteration 1:
In lda_py,in:iteration function, two work tables are created and we alternate between them for work input and work output tables.  
This is how we insert data into work_table_0 
{code}
            INSERT INTO {work_table}
            SELECT
                docid, wordcount, words, counts,
                {schema_madlib}.__lda_random_assign(wordcount, {topic_num}) AS topics
            FROM {data_table}
{code}

This is how we insert data into work_table_1
{code}
                INSERT INTO {work_table_out}
                SELECT
                    docid, wordcount, words, counts,
                    {schema_madlib}.__lda_gibbs_sample(
                        words, counts, doc_topic,
                        (SELECT model FROM {model_table}),
                        {alpha}, {beta}, {voc_size}, {topic_num}, 1)
                FROM
                    {work_table_in}
{code}
where  work_table_out=work_table_1, work_table_in=work_table_0, model_table=output of __lda_count_topic_agg

Workflow

1. lda_train calls iteration which sets up work_table_0.
2. iteration calls __lda_count_topic_agg which takes as input part of the column 'doc_topic'  from work_table_0. This is passed to lda.cpp:lda_count_topic_sfunc as  'topic_assignment'.
3. The output of __lda_count_topic_agg is saved as the model. 
4. We call __lda_gibbs_sample on the newly created model and insert data into work_table_1.
5. `work_table_0` is used to create the table `my_word_topic_count` and `work_table_1` is used to create the table `my_outdata`

{code}
madlib=# select * from __work_table_train_0__ ;
 docid | wordcount |   words   |  counts   |      doc_topic
-------+-----------+-----------+-----------+---------------------
     0 |         5 | {2,1,0,3} | {1,2,1,1} | {2,3,0,1,1,0,1}
     1 |         7 | {4,5,0,6} | {1,1,2,3} | {1,6,0,1,1,1,1,1,1}
(2 rows)

madlib=# select * from __work_table_train_1__ ;
 docid | wordcount |   words   |  counts   |      doc_topic
-------+-----------+-----------+-----------+---------------------
     0 |         5 | {2,1,0,3} | {1,2,1,1} | {2,3,0,1,1,1,0}
     1 |         7 | {4,5,0,6} | {1,1,2,3} | {1,6,1,0,1,1,1,1,1}
(2 rows)
{code}

Since both the tables have differnt topic assignments, we see the inconsistency between `my_word_topic_count` and `my_outdata`. 
 


> Inconsistent lda output tables
> ------------------------------
>
>                 Key: MADLIB-1201
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1201
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Parallel Latent Dirichlet Allocation
>            Reporter: Jingyi Mei
>            Priority: Major
>             Fix For: 1.14
>
>
> We found an inconsistency in the LDA module between the outputs of lda_train and lda_get_word_topic_count. 
> Repro Steps
> {code}
> DROP TABLE IF EXISTS documents;
> CREATE TABLE documents(docid INT4, contents TEXT);
> INSERT INTO documents VALUES
> (0, ' b a a c'),
> (1, ' d e f f f ');
> ALTER TABLE documents ADD COLUMN words TEXT[];
> UPDATE documents SET words = regexp_split_to_array(lower(contents), E'[\\s+\\.\\,]');
> DROP TABLE IF EXISTS my_training, my_training_vocabulary;
> SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training', TRUE);
> DROP TABLE IF EXISTS my_model, my_outdata;
> SELECT madlib.lda_train( 'my_training',
>                          'my_model',
>                          'my_outdata',
>                          7,
>                          2,
>                          1,
>                          5,
>                          0.01
>                        );
> select * from my_outdata order by docid;
> ```
>  docid | wordcount |   words   |  counts   | topic_count | topic_assignment
> -------+-----------+-----------+-----------+-------------+------------------
>      0 |         5 | {2,1,0,3} | {1,2,1,1} | {2,3}       | {0,1,1,1,0}
>      1 |         7 | {4,5,0,6} | {1,1,2,3} | {1,6}       | {1,0,1,1,1,1,1}
> ```
> DROP TABLE IF EXISTS my_word_topic_count;
> SELECT madlib.lda_get_word_topic_count( 'my_model', 'my_word_topic_count');
> SELECT * FROM my_word_topic_count ORDER BY wordid;
> ```
>  wordid | topic_count
> --------+-------------
>       0 | {1,2}
>       1 | {0,2}
>       2 | {1,0}
>       3 | {0,1}
>       4 | {1,0}
>       5 | {0,1}
>       6 | {0,3}
> (7 rows)
> ```
> {code}
> The output of 'my_outdata' indicates that wordid 3 gets assigned only to topic 0 but the output of my_word_topic_count indicates that wordid 3 gets assigned only to topic 1. This output seems to be inconsistent with each other. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)