You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@madlib.apache.org by jingyimei <gi...@git.apache.org> on 2018/02/07 22:43:40 UTC

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

GitHub user jingyimei opened a pull request:

    https://github.com/apache/madlib/pull/232

    Multiple LDA improvements and fixes

    Co-author: Nikhil Kak (nkak@pivotal.io)
    
    This PR addresses the following issues:
    
    JIRAs
    MADLIB-1160
    MADLIB-1201
    
    1. Ensure that the output of lda_train is consistent with the output of lda_get_word_topic_count
    2. Add a helper function, which will map each wordid with corresponding topicid that get assigned in output table.
    3. Address LDA topicid index inconsistency issue
    4. Fix LDA lda_get_topic_desc getting wrong top_k words issue
    
    All the commits are independent of each other and can be reviewed separately which might be easier than reviewing the files. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib lda_output_fix_final

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/232.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #232
    
----
commit a99883dc60877974e1a651a48489a08ec66584a3
Author: Jingyi Mei and Nikhil Kak <jm...@...>
Date:   2018-01-30T21:58:11Z

    Fix lda output inconsistency bug and add install check test
    
    JIRA: MADLIB-1201
    
    Fixed the issue of output of lda_train and lda_get_word_topic_count
    not matching each other. Added test case in install check.
    See jira for more details and example.
    
    Also added a install check that validates that the output of lda_train and
    lda_get_word_topic_count are consistent with each other.
    See jira for more details and example.

commit f0664230153ebe254e4c98e51ebc41bc7faaf327
Author: Jingyi Mei <jm...@...>
Date:   2018-01-31T02:20:59Z

    LDA: Add helper function to map wordid and topicid
    
    JIRA: MADLIB-1160
    
    This commit adds a helper function, which will map each wordid with
    corresponding topicid that get assigned in output table. Duplicate lines
    are removed from the final result.
    
    Also adds a workaround for GPDB4.3 svec
    
    In GPDB4.3, we cannot call madlib.svec directly on a text
    format.Instead, we have to call madlib.svec_from_string to convert the
    text. This commit fix this issue so the new helper function
    madlib.lda_get_word_topic_mapping can work on both gpdb5 and gpdb4.

commit a062acbf85d7044eaa37627a3904e456ab4aa309
Author: Jingyi Mei <jm...@...>
Date:   2018-01-31T20:21:10Z

    Address LDA topicid index inconsistency issue
    
    JIRA:MADLIB-1160
    
    This commit fixes the topicid inconsistency in madlib.lda_train
    and madlib.lda_get_topic_desc, where the former one uses 0 based index
    and the latter uses 1 index. Now they will all start at 0.

commit 7569049ba6bea5c4526db91478cbb165c79a2e60
Author: Jingyi Mei <jm...@...>
Date:   2018-01-31T20:32:19Z

    Fix LDA lda_get_topic_desc getting wrong top_k words issue
    
    JIRA: MADLIB-1160
    
    Previously, madlib.lda_get_topic_desc got top k - 1 words in the result
    table. This commit fixed it to be top k.

----


---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Posted by jingyimei <gi...@git.apache.org>.

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r167717958
  
    --- Diff: src/ports/postgres/modules/utilities/text_utilities.sql_in ---
    @@ -74,175 +81,231 @@ tasks related to text.
         Flag to indicate if a vocabulary is to be created. If TRUE, an additional
         output table is created containing the vocabulary of all words, with an id
         assigned to each word. The table is called <em>output_table</em>_vocabulary
    -    (suffix added to the <em>output_table</em> name) and contains the
    +    (i.e., suffix added to the <em>output_table</em> name) and contains the
         following columns:
    -        - \c wordid: An id assignment for each word
    -        - \c word: The word/term
    +        - \c wordid: An id for each word.
    --- End diff --
    
    We can mention it is in alphabetic ordering.


---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Posted by jingyimei <gi...@git.apache.org>.

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r167715245
  
    --- Diff: src/ports/postgres/modules/utilities/text_utilities.sql_in ---
    @@ -74,175 +81,231 @@ tasks related to text.
         Flag to indicate if a vocabulary is to be created. If TRUE, an additional
         output table is created containing the vocabulary of all words, with an id
         assigned to each word. The table is called <em>output_table</em>_vocabulary
    -    (suffix added to the <em>output_table</em> name) and contains the
    +    (i.e., suffix added to the <em>output_table</em> name) and contains the
         following columns:
    -        - \c wordid: An id assignment for each word
    -        - \c word: The word/term
    +        - \c wordid: An id for each word.
    +        - \c word: The word/term corresponding to the id.
         </dd>
     </dl>
     
     @anchor examples
     @par Examples
     
    --# Prepare datasets with some example documents
    +-# First we create a document table with one document per row:
     <pre class="example">
     DROP TABLE IF EXISTS documents;
    -CREATE TABLE documents(docid INTEGER, doc_contents TEXT);
    +CREATE TABLE documents(docid INT4, contents TEXT);
     INSERT INTO documents VALUES
    -(1, 'I like to eat broccoli and banana. I ate a banana and spinach smoothie for breakfast.'),
    -(2, 'Chinchillas and kittens are cute.'),
    -(3, 'My sister adopted two kittens yesterday'),
    -(4, 'Look at this cute hamster munching on a piece of broccoli');
    +(0, 'I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.'),
    +(1, 'Chinchillas and kittens are cute.'),
    +(2, 'My sister adopted two kittens yesterday.'),
    +(3, 'Look at this cute hamster munching on a piece of broccoli.');
     </pre>
    -
    --# Add a new column containing the words (lower-cased) in a text array
    +You can apply stemming, stop word removal and tokenization at this point 
    +in order to prepare the documents for text processing. 
    +Depending upon your database version, various tools are 
    +available. Databases based on more recent versions of 
    +PostgreSQL may do something like:
    +<pre class="example">
    +SELECT tsvector_to_array(to_tsvector('english',contents)) from documents;
    +</pre>
    +<pre class="result">
    +                    tsvector_to_array                     
    ++----------------------------------------------------------
    + {ate,banana,breakfast,broccoli,eat,like,smoothi,spinach}
    + {chinchilla,cute,kitten}
    + {adopt,kitten,sister,two,yesterday}
    + {broccoli,cute,hamster,look,munch,piec}
    +(4 rows)
    +</pre>
    +In this example, we assume a database based on an older 
    +version of PostgreSQL and just perform basic punctuation 
    +removal and tokenization. The array of words is added as 
    +a new column to the documents table:
     <pre class="example">
     ALTER TABLE documents ADD COLUMN words TEXT[];
    -UPDATE documents SET words = regexp_split_to_array(lower(doc_contents), E'[\\\\s+\\\\.]');
    +UPDATE documents SET words = 
    +    regexp_split_to_array(lower(
    +    regexp_replace(contents, E'[,.;\\']','', 'g')
    +    ), E'[\\\\s+]');
    +\\x on   
    +SELECT * FROM documents ORDER BY docid;
    +</pre>
    +<pre class="result">
    +-[ RECORD 1 ]------------------------------------------------------------------------------------
    +docid    | 0
    +contents | I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.
    +words    | {i,like,to,eat,broccoli,and,bananas,i,ate,a,banana,and,spinach,smoothie,for,breakfast}
    +-[ RECORD 2 ]------------------------------------------------------------------------------------
    +docid    | 1
    +contents | Chinchillas and kittens are cute.
    +words    | {chinchillas,and,kittens,are,cute}
    +-[ RECORD 3 ]------------------------------------------------------------------------------------
    +docid    | 2
    +contents | My sister adopted two kittens yesterday.
    +words    | {my,sister,adopted,two,kittens,yesterday}
    +-[ RECORD 4 ]------------------------------------------------------------------------------------
    +docid    | 3
    +contents | Look at this cute hamster munching on a piece of broccoli.
    +words    | {look,at,this,cute,hamster,munching,on,a,piece,of,broccoli}
     </pre>
     
    --# Compute the frequency of each word in each document
    +-# Compute the frequency of each word in each document:
     <pre class="example">
    -DROP TABLE IF EXISTS documents_tf;
    -SELECT madlib.term_frequency('documents', 'docid', 'words', 'documents_tf');
    -SELECT * FROM documents_tf order by docid;
    +DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
    +SELECT madlib.term_frequency('documents',    -- input table
    +                             'docid',        -- document id
    --- End diff --
    
    document id column


---

[GitHub] madlib issue #232: Multiple LDA improvements and fixes

Posted by fmcquillan99 <gi...@git.apache.org>.

Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/232
  
    Functional test of these 4 commits seem fine to me.  I added comments and examples in:  
    MADLIB-1160
    MADLIB-1201
    
    Will create a PR for associated user doc changes shortly.


---

[GitHub] madlib issue #232: Multiple LDA improvements and fixes

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/232
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/345/



---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Posted by jingyimei <gi...@git.apache.org>.

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r167708065
  
    --- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
    @@ -182,324 +105,789 @@ lda_train( data_table,
     \b Arguments
     <dl class="arglist">
         <dt>data_table</dt>
    -    <dd>TEXT. The name of the table storing the training dataset. Each row is
    +    <dd>TEXT. Name of the table storing the training dataset. Each row is
         in the form <tt>&lt;docid, wordid, count&gt;</tt> where \c docid, \c wordid, and \c count
    -    are non-negative integers.
    -
    +    are non-negative integers.  
         The \c docid column refers to the document ID, the \c wordid column is the
         word ID (the index of a word in the vocabulary), and \c count is the
    -    number of occurrences of the word in the document.
    -
    -    Please note that column names for \c docid, \c wordid, and \c count are currently fixed, so you must use these
    -    exact names in the data_table.</dd>
    +    number of occurrences of the word in the document. Please note:
    +    
    +    - \c wordid must be 
    +    contiguous integers going from from 0 to \c voc_size &minus; \c 1.
    +    - column names for \c docid, \c wordid, and \c count are currently fixed, 
    +    so you must use these exact names in the data_table.  
    +    
    +    The function <a href="group__grp__text__utilities.html">Term Frequency</a>
    +    can be used to generate vocabulary in the required format from raw documents.
    +    </dd>
     
         <dt>model_table</dt>
    -    <dd>TEXT. The name of the table storing the learned models. This table has one row and the following columns.
    +    <dd>TEXT. This is an output table generated by LDA which contains the learned model. 
    +    It has one row with the following columns:
             <table class="output">
                 <tr>
                     <th>voc_size</th>
    -                <td>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size &minus; \c 1.  A data validation routine is called to validate the dataset.</td>
    +                <td>INTEGER. Size of the vocabulary. As mentioned above for the input 
    +                table, \c wordid consists of contiguous integers going 
    +                from 0 to \c voc_size &minus; \c 1.  
    +                </td>
                 </tr>
                 <tr>
                     <th>topic_num</th>
                     <td>INTEGER. Number of topics.</td>
                 </tr>
                 <tr>
                     <th>alpha</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-document 
    +                topic multinomial.</td>
                 </tr>
                 <tr>
                     <th>beta</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-topic 
    +                word multinomial.</td>
                 </tr>
                 <tr>
                     <th>model</th>
    -                <td>BIGINT[].</td>
    +                <td>BIGINT[]. The encoded model description (not human readable).</td>
                 </tr>
             </table>
         </dd>
         <dt>output_data_table</dt>
    -    <dd>TEXT. The name of the table to store the output data. It has the following columns:
    +    <dd>TEXT. The name of the table generated by LDA that stores 
    +    the output data. It has the following columns:
             <table class="output">
                 <tr>
                     <th>docid</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Document id from input 'data_table'.</td>
                 </tr>
                 <tr>
                     <th>wordcount</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Count of number of words in the document, 
    +                including repeats. For example, if a word appears 3 times 
    +                in the document, it is counted 3 times.</td>
                 </tr>
                 <tr>
                     <th>words</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of \c wordid in the document, not
    +                including repeats.  For example, if a word appears 3 times 
    +                in the document, it appears only once in the \c words array.</td>
                 </tr>
                 <tr>
                     <th>counts</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Frequency of occurance of a word in the document,
    +                indexed the same as the \c words array above.  For example, if the
    +                2nd element of the \c counts array is 4, it means that the word
    +                in the 2nd element of the \c words array occurs 4 times in the
    +                document.</td>
                 </tr>
                 <tr>
                     <th>topic_count</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of the count of words in the document
    +                that correspond to each topic.</td>
    --- End diff --
    
    maybe mention array index corresponds to 0 - topic_num-1


---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Posted by jingyimei <gi...@git.apache.org>.

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r167709835
  
    --- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
    @@ -182,324 +105,789 @@ lda_train( data_table,
     \b Arguments
     <dl class="arglist">
         <dt>data_table</dt>
    -    <dd>TEXT. The name of the table storing the training dataset. Each row is
    +    <dd>TEXT. Name of the table storing the training dataset. Each row is
         in the form <tt>&lt;docid, wordid, count&gt;</tt> where \c docid, \c wordid, and \c count
    -    are non-negative integers.
    -
    +    are non-negative integers.  
         The \c docid column refers to the document ID, the \c wordid column is the
         word ID (the index of a word in the vocabulary), and \c count is the
    -    number of occurrences of the word in the document.
    -
    -    Please note that column names for \c docid, \c wordid, and \c count are currently fixed, so you must use these
    -    exact names in the data_table.</dd>
    +    number of occurrences of the word in the document. Please note:
    +    
    +    - \c wordid must be 
    +    contiguous integers going from from 0 to \c voc_size &minus; \c 1.
    +    - column names for \c docid, \c wordid, and \c count are currently fixed, 
    +    so you must use these exact names in the data_table.  
    +    
    +    The function <a href="group__grp__text__utilities.html">Term Frequency</a>
    +    can be used to generate vocabulary in the required format from raw documents.
    +    </dd>
     
         <dt>model_table</dt>
    -    <dd>TEXT. The name of the table storing the learned models. This table has one row and the following columns.
    +    <dd>TEXT. This is an output table generated by LDA which contains the learned model. 
    +    It has one row with the following columns:
             <table class="output">
                 <tr>
                     <th>voc_size</th>
    -                <td>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size &minus; \c 1.  A data validation routine is called to validate the dataset.</td>
    +                <td>INTEGER. Size of the vocabulary. As mentioned above for the input 
    +                table, \c wordid consists of contiguous integers going 
    +                from 0 to \c voc_size &minus; \c 1.  
    +                </td>
                 </tr>
                 <tr>
                     <th>topic_num</th>
                     <td>INTEGER. Number of topics.</td>
                 </tr>
                 <tr>
                     <th>alpha</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-document 
    +                topic multinomial.</td>
                 </tr>
                 <tr>
                     <th>beta</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-topic 
    +                word multinomial.</td>
                 </tr>
                 <tr>
                     <th>model</th>
    -                <td>BIGINT[].</td>
    +                <td>BIGINT[]. The encoded model description (not human readable).</td>
                 </tr>
             </table>
         </dd>
         <dt>output_data_table</dt>
    -    <dd>TEXT. The name of the table to store the output data. It has the following columns:
    +    <dd>TEXT. The name of the table generated by LDA that stores 
    +    the output data. It has the following columns:
             <table class="output">
                 <tr>
                     <th>docid</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Document id from input 'data_table'.</td>
                 </tr>
                 <tr>
                     <th>wordcount</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Count of number of words in the document, 
    +                including repeats. For example, if a word appears 3 times 
    +                in the document, it is counted 3 times.</td>
                 </tr>
                 <tr>
                     <th>words</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of \c wordid in the document, not
    +                including repeats.  For example, if a word appears 3 times 
    +                in the document, it appears only once in the \c words array.</td>
                 </tr>
                 <tr>
                     <th>counts</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Frequency of occurance of a word in the document,
    +                indexed the same as the \c words array above.  For example, if the
    +                2nd element of the \c counts array is 4, it means that the word
    +                in the 2nd element of the \c words array occurs 4 times in the
    +                document.</td>
                 </tr>
                 <tr>
                     <th>topic_count</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of the count of words in the document
    +                that correspond to each topic.</td>
                 </tr>
                 <tr>
                     <th>topic_assignment</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array indicating which topic each word in the 
    +                document corresponds to.  This array is of length \c  wordcount.</td>
                 </tr>
             </table>
         </dd>
         <dt>voc_size</dt>
    -    <dd>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size &minus; \c 1.  A data validation routine is called to validate the dataset.</dd>
    +    <dd>INTEGER. Size of the vocabulary. As mentioned above for the 
    +                input 'data_table', \c wordid consists of continous integers going 
    +                from 0 to \c voc_size &minus; \c 1.   
    +    </dd>
         <dt>topic_num</dt>
    -    <dd>INTEGER. Number of topics.</dd>
    +    <dd>INTEGER. Desired number of topics.</dd>
         <dt>iter_num</dt>
    -    <dd>INTEGER. Number of iterations (e.g. 60).</dd>
    +    <dd>INTEGER. Desired number of iterations.</dd>
         <dt>alpha</dt>
    -    <dd>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).</dd>
    +    <dd>DOUBLE PRECISION. Dirichlet prior for the per-document topic 
    +    multinomial (e.g., 50/topic_num is a typical value to start with).</dd>
         <dt>beta</dt>
    -    <dd>DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01).</dd>
    +    <dd>DOUBLE PRECISION. Dirichlet prior for the per-topic 
    +    word multinomial (e.g., 0.01 is a typical value to start with).</dd>
     </dl>
     
     @anchor predict
     @par Prediction Function
     
    -Prediction&mdash;labelling test documents using a learned LDA model&mdash;is accomplished with the following function:
    +Prediction involves labelling test documents using a learned LDA model:
     <pre class="syntax">
     lda_predict( data_table,
                  model_table,
    -             output_table
    +             output_predict_table
                );
     </pre>
    -
    -This function stores the prediction results in
    -<tt><em>output_table</em></tt>. Each row in the table stores the topic
    -distribution and the topic assignments for a document in the dataset. The
    -table has the following columns:
    -<table class="output">
    -    <tr>
    -        <th>docid</th>
    -        <td>INTEGER.</td>
    -    </tr>
    -    <tr>
    -        <th>wordcount</th>
    -        <td>INTEGER.</td>
    -    </tr>
    -    <tr>
    -        <th>words</th>
    -        <td>INTEGER[]. List of word IDs in this document.</td>
    -    </tr>
    -    <tr>
    -        <th>counts</th>
    -        <td>INTEGER[]. List of word counts in this document.</td>
    -    </tr>
    -    <tr>
    -        <th>topic_count</th>
    -        <td>INTEGER[]. Of length topic_num, list of topic counts in this document.</td>
    -    </tr>
    -    <tr>
    -        <th>topic_assignment</th>
    -        <td>INTEGER[]. Of length wordcount, list of topic index for each word.</td>
    -    </tr>
    -</table>
    +\b Arguments
    +<dl class="arglist">
    +<dt>data_table</dt>
    +    <dd>TEXT. Name of the table storing the test dataset 
    +    (new document to be labeled).
    +    </dd>
    +<dt>model_table</dt>
    +    <dd>TEXT. The model table generated by the training process.
    +    </dd>
    +<dt>output_predict_table</dt>
    +    <dd>TEXT. The prediction output table. 
    +    Each row in the table stores the topic 
    +    distribution and the topic assignments for a 
    +    document in the dataset. This table has the exact 
    +    same columns and interpretation as 
    +    the 'output_data_table' from the training function above. 
    +    </dd>
    +</dl>
     
     @anchor perplexity
    -@par Perplexity Function
    -This module provides a function for computing the perplexity.
    +@par Perplexity
    +Perplexity describes how well the model fits the data by 
    +computing word likelihoods averaged over the test documents.
    +This function returns a single perplexity value. 
     <pre class="syntax">
     lda_get_perplexity( model_table,
    -                    output_data_table
    +                    output_predict_table
                       );
     </pre>
    +\b Arguments
    +<dl class="arglist">
    +<dt>model_table</dt>
    +    <dd>TEXT. The model table generated by the training process.
    +    </dd>
    +<dt>output_predict_table</dt>
    +    <dd>TEXT. The prediction output table generated by the 
    +    predict function above.
    +    </dd>
    +</dl>
    +
    +@anchor helper
    +@par Helper Functions
    +
    +The helper functions can help to interpret the output 
    +from LDA training and LDA prediction.
    +
    +<b>Topic description by top-k words</b>
    --- End diff --
    
    top-k with highest probability. I saw u mention it later in the example, and I feel we can also mention it here with 3 more words.


---

[GitHub] madlib issue #232: Multiple LDA improvements and fixes

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/232
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/348/



---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Posted by jingyimei <gi...@git.apache.org>.

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r167720593
  
    --- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
    @@ -182,324 +105,789 @@ lda_train( data_table,
     \b Arguments
     <dl class="arglist">
         <dt>data_table</dt>
    -    <dd>TEXT. The name of the table storing the training dataset. Each row is
    +    <dd>TEXT. Name of the table storing the training dataset. Each row is
         in the form <tt>&lt;docid, wordid, count&gt;</tt> where \c docid, \c wordid, and \c count
    -    are non-negative integers.
    -
    +    are non-negative integers.  
         The \c docid column refers to the document ID, the \c wordid column is the
         word ID (the index of a word in the vocabulary), and \c count is the
    -    number of occurrences of the word in the document.
    -
    -    Please note that column names for \c docid, \c wordid, and \c count are currently fixed, so you must use these
    -    exact names in the data_table.</dd>
    +    number of occurrences of the word in the document. Please note:
    +    
    +    - \c wordid must be 
    +    contiguous integers going from from 0 to \c voc_size &minus; \c 1.
    +    - column names for \c docid, \c wordid, and \c count are currently fixed, 
    +    so you must use these exact names in the data_table.  
    +    
    +    The function <a href="group__grp__text__utilities.html">Term Frequency</a>
    +    can be used to generate vocabulary in the required format from raw documents.
    +    </dd>
     
         <dt>model_table</dt>
    -    <dd>TEXT. The name of the table storing the learned models. This table has one row and the following columns.
    +    <dd>TEXT. This is an output table generated by LDA which contains the learned model. 
    +    It has one row with the following columns:
             <table class="output">
                 <tr>
                     <th>voc_size</th>
    -                <td>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size &minus; \c 1.  A data validation routine is called to validate the dataset.</td>
    +                <td>INTEGER. Size of the vocabulary. As mentioned above for the input 
    +                table, \c wordid consists of contiguous integers going 
    +                from 0 to \c voc_size &minus; \c 1.  
    +                </td>
                 </tr>
                 <tr>
                     <th>topic_num</th>
                     <td>INTEGER. Number of topics.</td>
                 </tr>
                 <tr>
                     <th>alpha</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-document 
    +                topic multinomial.</td>
                 </tr>
                 <tr>
                     <th>beta</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-topic 
    +                word multinomial.</td>
                 </tr>
                 <tr>
                     <th>model</th>
    -                <td>BIGINT[].</td>
    +                <td>BIGINT[]. The encoded model description (not human readable).</td>
                 </tr>
             </table>
         </dd>
         <dt>output_data_table</dt>
    -    <dd>TEXT. The name of the table to store the output data. It has the following columns:
    +    <dd>TEXT. The name of the table generated by LDA that stores 
    +    the output data. It has the following columns:
             <table class="output">
                 <tr>
                     <th>docid</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Document id from input 'data_table'.</td>
                 </tr>
                 <tr>
                     <th>wordcount</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Count of number of words in the document, 
    +                including repeats. For example, if a word appears 3 times 
    +                in the document, it is counted 3 times.</td>
                 </tr>
                 <tr>
                     <th>words</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of \c wordid in the document, not
    +                including repeats.  For example, if a word appears 3 times 
    +                in the document, it appears only once in the \c words array.</td>
                 </tr>
                 <tr>
                     <th>counts</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Frequency of occurance of a word in the document,
    +                indexed the same as the \c words array above.  For example, if the
    +                2nd element of the \c counts array is 4, it means that the word
    +                in the 2nd element of the \c words array occurs 4 times in the
    +                document.</td>
                 </tr>
                 <tr>
                     <th>topic_count</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of the count of words in the document
    +                that correspond to each topic.</td>
                 </tr>
                 <tr>
                     <th>topic_assignment</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array indicating which topic each word in the 
    +                document corresponds to.  This array is of length \c  wordcount.</td>
                 </tr>
             </table>
         </dd>
         <dt>voc_size</dt>
    -    <dd>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size &minus; \c 1.  A data validation routine is called to validate the dataset.</dd>
    +    <dd>INTEGER. Size of the vocabulary. As mentioned above for the 
    +                input 'data_table', \c wordid consists of continous integers going 
    +                from 0 to \c voc_size &minus; \c 1.   
    +    </dd>
         <dt>topic_num</dt>
    -    <dd>INTEGER. Number of topics.</dd>
    +    <dd>INTEGER. Desired number of topics.</dd>
         <dt>iter_num</dt>
    -    <dd>INTEGER. Number of iterations (e.g. 60).</dd>
    +    <dd>INTEGER. Desired number of iterations.</dd>
         <dt>alpha</dt>
    -    <dd>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).</dd>
    +    <dd>DOUBLE PRECISION. Dirichlet prior for the per-document topic 
    +    multinomial (e.g., 50/topic_num is a typical value to start with).</dd>
    --- End diff --
    
    I found different libraries do use different starting value, e.g. 1/k, 5/k and 0.1. We can mention this value (50/k) is suggested in Griffiths and Steyvers Paper.



---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Posted by jingyimei <gi...@git.apache.org>.

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r167708544
  
    --- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
    @@ -182,324 +105,789 @@ lda_train( data_table,
     \b Arguments
     <dl class="arglist">
         <dt>data_table</dt>
    -    <dd>TEXT. The name of the table storing the training dataset. Each row is
    +    <dd>TEXT. Name of the table storing the training dataset. Each row is
         in the form <tt>&lt;docid, wordid, count&gt;</tt> where \c docid, \c wordid, and \c count
    -    are non-negative integers.
    -
    +    are non-negative integers.  
         The \c docid column refers to the document ID, the \c wordid column is the
         word ID (the index of a word in the vocabulary), and \c count is the
    -    number of occurrences of the word in the document.
    -
    -    Please note that column names for \c docid, \c wordid, and \c count are currently fixed, so you must use these
    -    exact names in the data_table.</dd>
    +    number of occurrences of the word in the document. Please note:
    +    
    +    - \c wordid must be 
    +    contiguous integers going from from 0 to \c voc_size &minus; \c 1.
    +    - column names for \c docid, \c wordid, and \c count are currently fixed, 
    +    so you must use these exact names in the data_table.  
    +    
    +    The function <a href="group__grp__text__utilities.html">Term Frequency</a>
    +    can be used to generate vocabulary in the required format from raw documents.
    +    </dd>
     
         <dt>model_table</dt>
    -    <dd>TEXT. The name of the table storing the learned models. This table has one row and the following columns.
    +    <dd>TEXT. This is an output table generated by LDA which contains the learned model. 
    +    It has one row with the following columns:
             <table class="output">
                 <tr>
                     <th>voc_size</th>
    -                <td>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size &minus; \c 1.  A data validation routine is called to validate the dataset.</td>
    +                <td>INTEGER. Size of the vocabulary. As mentioned above for the input 
    +                table, \c wordid consists of contiguous integers going 
    +                from 0 to \c voc_size &minus; \c 1.  
    +                </td>
                 </tr>
                 <tr>
                     <th>topic_num</th>
                     <td>INTEGER. Number of topics.</td>
                 </tr>
                 <tr>
                     <th>alpha</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-document 
    +                topic multinomial.</td>
                 </tr>
                 <tr>
                     <th>beta</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-topic 
    +                word multinomial.</td>
                 </tr>
                 <tr>
                     <th>model</th>
    -                <td>BIGINT[].</td>
    +                <td>BIGINT[]. The encoded model description (not human readable).</td>
                 </tr>
             </table>
         </dd>
         <dt>output_data_table</dt>
    -    <dd>TEXT. The name of the table to store the output data. It has the following columns:
    +    <dd>TEXT. The name of the table generated by LDA that stores 
    +    the output data. It has the following columns:
             <table class="output">
                 <tr>
                     <th>docid</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Document id from input 'data_table'.</td>
                 </tr>
                 <tr>
                     <th>wordcount</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Count of number of words in the document, 
    +                including repeats. For example, if a word appears 3 times 
    +                in the document, it is counted 3 times.</td>
                 </tr>
                 <tr>
                     <th>words</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of \c wordid in the document, not
    +                including repeats.  For example, if a word appears 3 times 
    +                in the document, it appears only once in the \c words array.</td>
                 </tr>
                 <tr>
                     <th>counts</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Frequency of occurance of a word in the document,
    +                indexed the same as the \c words array above.  For example, if the
    +                2nd element of the \c counts array is 4, it means that the word
    +                in the 2nd element of the \c words array occurs 4 times in the
    +                document.</td>
                 </tr>
                 <tr>
                     <th>topic_count</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of the count of words in the document
    +                that correspond to each topic.</td>
                 </tr>
                 <tr>
                     <th>topic_assignment</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array indicating which topic each word in the 
    +                document corresponds to.  This array is of length \c  wordcount.</td>
                 </tr>
             </table>
         </dd>
         <dt>voc_size</dt>
    -    <dd>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size &minus; \c 1.  A data validation routine is called to validate the dataset.</dd>
    +    <dd>INTEGER. Size of the vocabulary. As mentioned above for the 
    +                input 'data_table', \c wordid consists of continous integers going 
    +                from 0 to \c voc_size &minus; \c 1.   
    +    </dd>
         <dt>topic_num</dt>
    -    <dd>INTEGER. Number of topics.</dd>
    +    <dd>INTEGER. Desired number of topics.</dd>
         <dt>iter_num</dt>
    -    <dd>INTEGER. Number of iterations (e.g. 60).</dd>
    +    <dd>INTEGER. Desired number of iterations.</dd>
         <dt>alpha</dt>
    -    <dd>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).</dd>
    +    <dd>DOUBLE PRECISION. Dirichlet prior for the per-document topic 
    +    multinomial (e.g., 50/topic_num is a typical value to start with).</dd>
         <dt>beta</dt>
    -    <dd>DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01).</dd>
    +    <dd>DOUBLE PRECISION. Dirichlet prior for the per-topic 
    +    word multinomial (e.g., 0.01 is a typical value to start with).</dd>
     </dl>
     
     @anchor predict
     @par Prediction Function
     
    -Prediction&mdash;labelling test documents using a learned LDA model&mdash;is accomplished with the following function:
    +Prediction involves labelling test documents using a learned LDA model:
     <pre class="syntax">
     lda_predict( data_table,
                  model_table,
    -             output_table
    +             output_predict_table
                );
     </pre>
    -
    -This function stores the prediction results in
    -<tt><em>output_table</em></tt>. Each row in the table stores the topic
    -distribution and the topic assignments for a document in the dataset. The
    -table has the following columns:
    -<table class="output">
    -    <tr>
    -        <th>docid</th>
    -        <td>INTEGER.</td>
    -    </tr>
    -    <tr>
    -        <th>wordcount</th>
    -        <td>INTEGER.</td>
    -    </tr>
    -    <tr>
    -        <th>words</th>
    -        <td>INTEGER[]. List of word IDs in this document.</td>
    -    </tr>
    -    <tr>
    -        <th>counts</th>
    -        <td>INTEGER[]. List of word counts in this document.</td>
    -    </tr>
    -    <tr>
    -        <th>topic_count</th>
    -        <td>INTEGER[]. Of length topic_num, list of topic counts in this document.</td>
    -    </tr>
    -    <tr>
    -        <th>topic_assignment</th>
    -        <td>INTEGER[]. Of length wordcount, list of topic index for each word.</td>
    -    </tr>
    -</table>
    +\b Arguments
    +<dl class="arglist">
    +<dt>data_table</dt>
    +    <dd>TEXT. Name of the table storing the test dataset 
    +    (new document to be labeled).
    +    </dd>
    +<dt>model_table</dt>
    +    <dd>TEXT. The model table generated by the training process.
    +    </dd>
    +<dt>output_predict_table</dt>
    +    <dd>TEXT. The prediction output table. 
    +    Each row in the table stores the topic 
    +    distribution and the topic assignments for a 
    +    document in the dataset. This table has the exact 
    +    same columns and interpretation as 
    +    the 'output_data_table' from the training function above. 
    +    </dd>
    +</dl>
     
     @anchor perplexity
    -@par Perplexity Function
    -This module provides a function for computing the perplexity.
    +@par Perplexity
    +Perplexity describes how well the model fits the data by 
    +computing word likelihoods averaged over the test documents.
    +This function returns a single perplexity value. 
     <pre class="syntax">
     lda_get_perplexity( model_table,
    -                    output_data_table
    +                    output_predict_table
                       );
     </pre>
    +\b Arguments
    +<dl class="arglist">
    +<dt>model_table</dt>
    +    <dd>TEXT. The model table generated by the training process.
    +    </dd>
    +<dt>output_predict_table</dt>
    +    <dd>TEXT. The prediction output table generated by the 
    +    predict function above.
    +    </dd>
    +</dl>
    +
    +@anchor helper
    +@par Helper Functions
    +
    +The helper functions can help to interpret the output 
    +from LDA training and LDA prediction.
    +
    +<b>Topic description by top-k words</b>
    +
    +Applies to LDA training only.
    +
    +<pre class="syntax">
    +lda_get_topic_desc( model_table,
    +                    vocab_table,
    +                    output_table,
    +                    top_k
    +                  )
    +</pre>
    +\b Arguments
    +<dl class="arglist">
    +<dt>model_table</dt>
    +    <dd>TEXT. The model table generated by the training process.
    +    </dd>
    +<dt>vocab_table</dt>
    +    <dd>TEXT. The vocabulary table in the form <wordid, word>. 
    --- End diff --
    
    Can mention this table can generated from term_frequency.


---

[GitHub] madlib issue #232: Multiple LDA improvements and fixes

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/232
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/340/



---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Posted by jingyimei <gi...@git.apache.org>.

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r167715361
  
    --- Diff: src/ports/postgres/modules/utilities/text_utilities.sql_in ---
    @@ -74,175 +81,231 @@ tasks related to text.
         Flag to indicate if a vocabulary is to be created. If TRUE, an additional
         output table is created containing the vocabulary of all words, with an id
         assigned to each word. The table is called <em>output_table</em>_vocabulary
    -    (suffix added to the <em>output_table</em> name) and contains the
    +    (i.e., suffix added to the <em>output_table</em> name) and contains the
         following columns:
    -        - \c wordid: An id assignment for each word
    -        - \c word: The word/term
    +        - \c wordid: An id for each word.
    +        - \c word: The word/term corresponding to the id.
         </dd>
     </dl>
     
     @anchor examples
     @par Examples
     
    --# Prepare datasets with some example documents
    +-# First we create a document table with one document per row:
     <pre class="example">
     DROP TABLE IF EXISTS documents;
    -CREATE TABLE documents(docid INTEGER, doc_contents TEXT);
    +CREATE TABLE documents(docid INT4, contents TEXT);
     INSERT INTO documents VALUES
    -(1, 'I like to eat broccoli and banana. I ate a banana and spinach smoothie for breakfast.'),
    -(2, 'Chinchillas and kittens are cute.'),
    -(3, 'My sister adopted two kittens yesterday'),
    -(4, 'Look at this cute hamster munching on a piece of broccoli');
    +(0, 'I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.'),
    +(1, 'Chinchillas and kittens are cute.'),
    +(2, 'My sister adopted two kittens yesterday.'),
    +(3, 'Look at this cute hamster munching on a piece of broccoli.');
     </pre>
    -
    --# Add a new column containing the words (lower-cased) in a text array
    +You can apply stemming, stop word removal and tokenization at this point 
    +in order to prepare the documents for text processing. 
    +Depending upon your database version, various tools are 
    +available. Databases based on more recent versions of 
    +PostgreSQL may do something like:
    +<pre class="example">
    +SELECT tsvector_to_array(to_tsvector('english',contents)) from documents;
    +</pre>
    +<pre class="result">
    +                    tsvector_to_array                     
    ++----------------------------------------------------------
    + {ate,banana,breakfast,broccoli,eat,like,smoothi,spinach}
    + {chinchilla,cute,kitten}
    + {adopt,kitten,sister,two,yesterday}
    + {broccoli,cute,hamster,look,munch,piec}
    +(4 rows)
    +</pre>
    +In this example, we assume a database based on an older 
    +version of PostgreSQL and just perform basic punctuation 
    +removal and tokenization. The array of words is added as 
    +a new column to the documents table:
     <pre class="example">
     ALTER TABLE documents ADD COLUMN words TEXT[];
    -UPDATE documents SET words = regexp_split_to_array(lower(doc_contents), E'[\\\\s+\\\\.]');
    +UPDATE documents SET words = 
    +    regexp_split_to_array(lower(
    +    regexp_replace(contents, E'[,.;\\']','', 'g')
    +    ), E'[\\\\s+]');
    +\\x on   
    +SELECT * FROM documents ORDER BY docid;
    +</pre>
    +<pre class="result">
    +-[ RECORD 1 ]------------------------------------------------------------------------------------
    +docid    | 0
    +contents | I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.
    +words    | {i,like,to,eat,broccoli,and,bananas,i,ate,a,banana,and,spinach,smoothie,for,breakfast}
    +-[ RECORD 2 ]------------------------------------------------------------------------------------
    +docid    | 1
    +contents | Chinchillas and kittens are cute.
    +words    | {chinchillas,and,kittens,are,cute}
    +-[ RECORD 3 ]------------------------------------------------------------------------------------
    +docid    | 2
    +contents | My sister adopted two kittens yesterday.
    +words    | {my,sister,adopted,two,kittens,yesterday}
    +-[ RECORD 4 ]------------------------------------------------------------------------------------
    +docid    | 3
    +contents | Look at this cute hamster munching on a piece of broccoli.
    +words    | {look,at,this,cute,hamster,munching,on,a,piece,of,broccoli}
     </pre>
     
    --# Compute the frequency of each word in each document
    +-# Compute the frequency of each word in each document:
     <pre class="example">
    -DROP TABLE IF EXISTS documents_tf;
    -SELECT madlib.term_frequency('documents', 'docid', 'words', 'documents_tf');
    -SELECT * FROM documents_tf order by docid;
    +DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
    +SELECT madlib.term_frequency('documents',    -- input table
    +                             'docid',        -- document id
    +                             'words',        -- vector of words in document
    +                             'documents_tf'  -- output table
    +                            );
    +\\x off
    +SELECT * FROM documents_tf ORDER BY docid;
     </pre>
     <pre class="result">
    - docid |    word    | count
    --------+------------+-------
    -     1 | ate        |     1
    -     1 | like       |     1
    -     1 | breakfast  |     1
    -     1 | to         |     1
    -     1 | broccoli   |     1
    -     1 | spinach    |     1
    -     1 | i          |     2
    -     1 | and        |     2
    -     1 | a          |     1
    -     1 |            |     2
    -     1 | smoothie   |     1
    -     1 | eat        |     1
    -     1 | banana     |     2
    -     1 | for        |     1
    -     2 | cute       |     1
    -     2 | are        |     1
    -     2 | kitten     |     1
    -     2 | and        |     1
    -     2 | chinchilla |     1
    -     3 | kitten     |     1
    -     3 | my         |     1
    -     3 | a          |     1
    -     3 | sister     |     1
    -     3 | adopted    |     1
    -     3 | yesterday  |     1
    -     4 | at         |     1
    -     4 | of         |     1
    -     4 | piece      |     1
    -     4 | this       |     1
    -     4 | a          |     1
    -     4 | broccoli   |     1
    -     4 | hamster    |     1
    -     4 | munching   |     1
    -     4 | cute       |     1
    -     4 | look       |     1
    -(35 rows)
    + docid |    word     | count 
    +-------+-------------+-------
    +     0 | a           |     1
    +     0 | breakfast   |     1
    +     0 | banana      |     1
    +     0 | and         |     2
    +     0 | eat         |     1
    +     0 | smoothie    |     1
    +     0 | to          |     1
    +     0 | like        |     1
    +     0 | broccoli    |     1
    +     0 | bananas     |     1
    +     0 | spinach     |     1
    +     0 | i           |     2
    +     0 | ate         |     1
    +     0 | for         |     1
    +     1 | are         |     1
    +     1 | cute        |     1
    +     1 | kittens     |     1
    +     1 | chinchillas |     1
    +     1 | and         |     1
    +     2 | two         |     1
    +     2 | yesterday   |     1
    +     2 | kittens     |     1
    +     2 | sister      |     1
    +     2 | my          |     1
    +     2 | adopted     |     1
    +     3 | this        |     1
    +     3 | at          |     1
    +     3 | a           |     1
    +     3 | broccoli    |     1
    +     3 | of          |     1
    +     3 | look        |     1
    +     3 | hamster     |     1
    +     3 | on          |     1
    +     3 | piece       |     1
    +     3 | cute        |     1
    +     3 | munching    |     1
    +(36 rows)
     </pre>
     
    --# We also can create a vocabulary of the words and store a wordid in the output
    -table instead of the actual word.
    +-# Next we create a vocabulary of the words 
    +and store a wordid in the output table instead of the 
    +actual word:
     <pre class="example">
    -DROP TABLE IF EXISTS documents_tf;
    -DROP TABLE IF EXISTS documents_tf_vocabulary;
    -SELECT madlib.term_frequency('documents', 'docid', 'words', 'documents_tf', TRUE);
    --- Output with wordid instead of the actual words
    -SELECT * FROM documents_tf order by docid;
    +DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
    +SELECT madlib.term_frequency('documents',    -- input table
    +                             'docid',        -- document id
    --- End diff --
    
    document id column


---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Posted by fmcquillan99 <gi...@git.apache.org>.

Github user fmcquillan99 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r168248554
  
    --- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
    @@ -182,324 +105,789 @@ lda_train( data_table,
     \b Arguments
     <dl class="arglist">
         <dt>data_table</dt>
    -    <dd>TEXT. The name of the table storing the training dataset. Each row is
    +    <dd>TEXT. Name of the table storing the training dataset. Each row is
         in the form <tt>&lt;docid, wordid, count&gt;</tt> where \c docid, \c wordid, and \c count
    -    are non-negative integers.
    -
    +    are non-negative integers.  
         The \c docid column refers to the document ID, the \c wordid column is the
         word ID (the index of a word in the vocabulary), and \c count is the
    -    number of occurrences of the word in the document.
    -
    -    Please note that column names for \c docid, \c wordid, and \c count are currently fixed, so you must use these
    -    exact names in the data_table.</dd>
    +    number of occurrences of the word in the document. Please note:
    +    
    +    - \c wordid must be 
    +    contiguous integers going from from 0 to \c voc_size &minus; \c 1.
    +    - column names for \c docid, \c wordid, and \c count are currently fixed, 
    +    so you must use these exact names in the data_table.  
    +    
    +    The function <a href="group__grp__text__utilities.html">Term Frequency</a>
    +    can be used to generate vocabulary in the required format from raw documents.
    +    </dd>
     
         <dt>model_table</dt>
    -    <dd>TEXT. The name of the table storing the learned models. This table has one row and the following columns.
    +    <dd>TEXT. This is an output table generated by LDA which contains the learned model. 
    +    It has one row with the following columns:
             <table class="output">
                 <tr>
                     <th>voc_size</th>
    -                <td>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size &minus; \c 1.  A data validation routine is called to validate the dataset.</td>
    +                <td>INTEGER. Size of the vocabulary. As mentioned above for the input 
    +                table, \c wordid consists of contiguous integers going 
    +                from 0 to \c voc_size &minus; \c 1.  
    +                </td>
                 </tr>
                 <tr>
                     <th>topic_num</th>
                     <td>INTEGER. Number of topics.</td>
                 </tr>
                 <tr>
                     <th>alpha</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-document 
    +                topic multinomial.</td>
                 </tr>
                 <tr>
                     <th>beta</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-topic 
    +                word multinomial.</td>
                 </tr>
                 <tr>
                     <th>model</th>
    -                <td>BIGINT[].</td>
    +                <td>BIGINT[]. The encoded model description (not human readable).</td>
                 </tr>
             </table>
         </dd>
         <dt>output_data_table</dt>
    -    <dd>TEXT. The name of the table to store the output data. It has the following columns:
    +    <dd>TEXT. The name of the table generated by LDA that stores 
    +    the output data. It has the following columns:
             <table class="output">
                 <tr>
                     <th>docid</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Document id from input 'data_table'.</td>
                 </tr>
                 <tr>
                     <th>wordcount</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Count of number of words in the document, 
    +                including repeats. For example, if a word appears 3 times 
    +                in the document, it is counted 3 times.</td>
                 </tr>
                 <tr>
                     <th>words</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of \c wordid in the document, not
    +                including repeats.  For example, if a word appears 3 times 
    +                in the document, it appears only once in the \c words array.</td>
                 </tr>
                 <tr>
                     <th>counts</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Frequency of occurance of a word in the document,
    +                indexed the same as the \c words array above.  For example, if the
    +                2nd element of the \c counts array is 4, it means that the word
    +                in the 2nd element of the \c words array occurs 4 times in the
    +                document.</td>
                 </tr>
                 <tr>
                     <th>topic_count</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of the count of words in the document
    +                that correspond to each topic.</td>
    --- End diff --
    
    Array indexing is a developer thing not a user thing, but I will add:
    
    "This array is of length \c topic_num."


---

[GitHub] madlib issue #232: Multiple LDA improvements and fixes

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/232
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/341/



---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/madlib/pull/232


---

[GitHub] madlib issue #232: Multiple LDA improvements and fixes

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/232
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/339/



---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r168526426
  
    --- Diff: src/ports/postgres/modules/lda/lda.py_in ---
    @@ -120,14 +120,22 @@ class LDATrainer:
             # etime = time.time()
             # plpy.notice('\t\ttime elapsed: %.2f seconds' % (etime - stime))
     
    -    def gen_output_data_table(self):
    +    def gen_final_data_tables(self):
             # stime = time.time()
             # plpy.notice('\t\tgenerating output data table ...')
     
    -        work_table_final = self.work_table_1
    -        if self.iter_num % 2 == 0:
    -            work_table_final = self.work_table_0
    +        ##This function updates 2 tables, one is the model table and
    +        # the other one is the output table
     
    +        work_table_final = self.work_table_0 if self.iter_num % 2 == 0 \
    +        else self.work_table_1
    +
    +        # Update model table
    +        #JIRA:MADLIB-1201, we have to update model table one more time after
    --- End diff --
    
    I suggest not adding the JIRA number here, because it doesn't add more context. 


---

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

Posted by jingyimei <gi...@git.apache.org>.

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/232#discussion_r167708360
  
    --- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
    @@ -182,324 +105,789 @@ lda_train( data_table,
     \b Arguments
     <dl class="arglist">
         <dt>data_table</dt>
    -    <dd>TEXT. The name of the table storing the training dataset. Each row is
    +    <dd>TEXT. Name of the table storing the training dataset. Each row is
         in the form <tt>&lt;docid, wordid, count&gt;</tt> where \c docid, \c wordid, and \c count
    -    are non-negative integers.
    -
    +    are non-negative integers.  
         The \c docid column refers to the document ID, the \c wordid column is the
         word ID (the index of a word in the vocabulary), and \c count is the
    -    number of occurrences of the word in the document.
    -
    -    Please note that column names for \c docid, \c wordid, and \c count are currently fixed, so you must use these
    -    exact names in the data_table.</dd>
    +    number of occurrences of the word in the document. Please note:
    +    
    +    - \c wordid must be 
    +    contiguous integers going from from 0 to \c voc_size &minus; \c 1.
    +    - column names for \c docid, \c wordid, and \c count are currently fixed, 
    +    so you must use these exact names in the data_table.  
    +    
    +    The function <a href="group__grp__text__utilities.html">Term Frequency</a>
    +    can be used to generate vocabulary in the required format from raw documents.
    +    </dd>
     
         <dt>model_table</dt>
    -    <dd>TEXT. The name of the table storing the learned models. This table has one row and the following columns.
    +    <dd>TEXT. This is an output table generated by LDA which contains the learned model. 
    +    It has one row with the following columns:
             <table class="output">
                 <tr>
                     <th>voc_size</th>
    -                <td>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size &minus; \c 1.  A data validation routine is called to validate the dataset.</td>
    +                <td>INTEGER. Size of the vocabulary. As mentioned above for the input 
    +                table, \c wordid consists of contiguous integers going 
    +                from 0 to \c voc_size &minus; \c 1.  
    +                </td>
                 </tr>
                 <tr>
                     <th>topic_num</th>
                     <td>INTEGER. Number of topics.</td>
                 </tr>
                 <tr>
                     <th>alpha</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-document 
    +                topic multinomial.</td>
                 </tr>
                 <tr>
                     <th>beta</th>
    -                <td>DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01).</td>
    +                <td>DOUBLE PRECISION. Dirichlet prior for the per-topic 
    +                word multinomial.</td>
                 </tr>
                 <tr>
                     <th>model</th>
    -                <td>BIGINT[].</td>
    +                <td>BIGINT[]. The encoded model description (not human readable).</td>
                 </tr>
             </table>
         </dd>
         <dt>output_data_table</dt>
    -    <dd>TEXT. The name of the table to store the output data. It has the following columns:
    +    <dd>TEXT. The name of the table generated by LDA that stores 
    +    the output data. It has the following columns:
             <table class="output">
                 <tr>
                     <th>docid</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Document id from input 'data_table'.</td>
                 </tr>
                 <tr>
                     <th>wordcount</th>
    -                <td>INTEGER.</td>
    +                <td>INTEGER. Count of number of words in the document, 
    +                including repeats. For example, if a word appears 3 times 
    +                in the document, it is counted 3 times.</td>
                 </tr>
                 <tr>
                     <th>words</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of \c wordid in the document, not
    +                including repeats.  For example, if a word appears 3 times 
    +                in the document, it appears only once in the \c words array.</td>
                 </tr>
                 <tr>
                     <th>counts</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Frequency of occurance of a word in the document,
    +                indexed the same as the \c words array above.  For example, if the
    +                2nd element of the \c counts array is 4, it means that the word
    +                in the 2nd element of the \c words array occurs 4 times in the
    +                document.</td>
                 </tr>
                 <tr>
                     <th>topic_count</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array of the count of words in the document
    +                that correspond to each topic.</td>
                 </tr>
                 <tr>
                     <th>topic_assignment</th>
    -                <td>INTEGER[].</td>
    +                <td>INTEGER[]. Array indicating which topic each word in the 
    +                document corresponds to.  This array is of length \c  wordcount.</td>
    --- End diff --
    
    We can mention repeated word will show N times consecutively in the array.


---