You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@madlib.apache.org by Markus Paaso <ma...@gmail.com> on 2017/08/11 12:23:00 UTC

LDA output format

Hi,

I am having some problems reading the LDA output.


Please see this row of madlib.lda_train output:

docid            | 6
wordcount        | 33
words            |
{7386,42021,7705,105334,18083,89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,30422,94672,62107,94673,62080,101046,
4379,26503,61105,19193,28929}
counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,5,1,1,1,1}
topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
topic_assignment |
{2,16,16,11,15,2,2,2,2,15,15,2,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}


It's hard to find which word ids are topic ids assigned to given when
*words* array have different length than *topic_assignment* array.
It would be nice if *words* array was same length than *topic_assignment*
array

1. What kind of SQL query would give a result with wordid - topicid pairs?
I tried to match them by hand but failed for wordid: 28934. I wonder if a
repeating wordid can have different topic assignments in a same document?

wordid | topicid
----------------
7386   | 2
42021  | 16
7705   | 11
105334 | 15
18083  | 2
89364  | 2
31073  | 2
28934  | 2 OR 15 ?
56286  | 15
61921  | 2
59142  | 16
33364  | 2
79035  | 16
37792  | 10
91823  | 10
30422  | 2
94672  | 16
62107  | 2
94673  | 1
62080  | 15
101046 | 16
4379   | 7
26503  | 11
61105  | 2
19193  | 2
28929  | 2


2. Why is the *topic_assignment* using zero based indexing while other
results use one base indexing?



Regards,
Markus Paaso

Re: LDA output format

Posted by Markus Paaso <ma...@gmail.com>.
Thanks Frank!

I will keep track on this issue.


Regards, Markus

2017-09-22 23:32 GMT+03:00 Frank McQuillan <fm...@pivotal.io>:

> Markus,
>
> Thanks for the extra explanation.
>
> I agree that having topic_assignment as a dense vector but words as a
> sparse vector is not good practice, I am not sure why it was done that way
> in the first place.  They should be consistent.
>
> The reason sparse vectors are used I believe is to keep the LDA model
> state as small as possible, since it is possible to hit the 1GB PostgreSQL
> field limit in certain real world use cases.
>
> So I think what you are doing seems reasonable.  I don’t have a simpler
> way to do it at this point, given the current way LDA is built.
>
> I have updated this JIRA
> https://issues.apache.org/jira/browse/MADLIB-1160
> proposing:
>
> 1) use consistent indexes (start at 1)
> 2) use consistent array types (sparse)
>
> which hopefully will help some.
>
> Frank
>
>
> On Fri, Sep 22, 2017 at 12:18 AM, Markus Paaso <ma...@gmail.com>
> wrote:
>
>> Hi Frank,
>>
>> Thanks for the response.
>> The solution you suggest doesn't fully answer to my question.
>>
>> I don't want to join input table with top-k wordids from each topic as
>> this is not what lda_predict is doing. Am I right?
>> I want to explain the results of madlib.lda_predict exactly.
>> I want to explain why lda_predict gives some topic distribution for some
>> set of wordids.
>>
>> I understand your approach but unfortunately it doesn't fulfill my needs.
>> I have found the solution for my needs but and was just asking if you
>> know better way to handle the results of lda_predict (or data in
>> *output_data_table *of lda_train).
>>
>> So my question was more about how to construct wordid array from 'counts'
>> and 'words' columns of madlib.lda_predict results or e.g. 'my_outdata' in
>> your example.
>> Now I am doing it with (counts::text || ':' ||
>> words::text)::madlib.svec::float[]
>>
>> The thing I don't understand in *output_data_table* format is that why
>> *topic_assignment* is a dense vector but *words* is a sparse vector.
>> I would suggest to change *words* format into dense vector to remove the
>> need of playing with madlib.svec object to explain the lda_predict results
>> exactly.
>> And then the *counts* column is not needed anymore in *output_data_table*
>> .
>>
>>
>> Regards,
>> Markus Paaso
>>
>>
>>
>>
>>
>>
>> 2017-09-21 22:25 GMT+03:00 Frank McQuillan <fm...@pivotal.io>:
>>
>>> Markus,
>>>
>>> Sorry for the delay in response.  Below is another approach that
>>> involves a JOIN between term frequency and an LDA output table.  It is a
>>> bit simpler at least for me.
>>>
>>> Regarding the indexing, you are correct that it is inconsistent.  It
>>> seems term frequency module uses 0 based indexing (e.g., docid) whereas LDA
>>> uses 1 based indexing (e.g., topicid).  I created a JIRA to address this
>>> https://issues.apache.org/jira/browse/MADLIB-1160
>>> probably the best way is to standardize on 1 based indexing.
>>>
>>> ----
>>>
>>> 1) Create data set
>>>
>>> DROP TABLE IF EXISTS documents;
>>> CREATE TABLE documents(docid INT4, contents TEXT);
>>> INSERT INTO documents VALUES
>>> (0, 'Statistical topic models are a class of Bayesian latent variable
>>> models, originally developed for analyzing the semantic content of large
>>> document corpora.'),
>>> (1, 'By the late 1960s, the balance between pitching and hitting had
>>> swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American
>>> League batting title with an average of just .301, the lowest in history.'),
>>> (2, 'Machine learning is closely related to and often overlaps with
>>> computational statistics; a discipline that also specializes in
>>> prediction-making. It has strong ties to mathematical optimization, which
>>> deliver methods, theory and application domains to the field.'),
>>> (3, 'California''s diverse geography ranges from the Sierra Nevada in
>>> the east to the Pacific Coast in the west, from the Redwood–Douglas fir
>>> forests of the northwest, to the Mojave Desert areas in the southeast. The
>>> center of the state is dominated by the Central Valley, a major
>>> agricultural area. ');
>>>
>>>
>>> 2) Convert docs to a list of words
>>>
>>> ALTER TABLE documents ADD COLUMN words TEXT[];
>>> UPDATE documents SET words = regexp_split_to_array(lower(contents),
>>> E'[\\s+\\.\\,]');
>>>
>>>
>>> 3) Term frequency
>>>
>>> DROP TABLE IF EXISTS my_training, my_training_vocabulary;
>>> SELECT madlib.term_frequency('documents', 'docid', 'words',
>>> 'my_training', TRUE);
>>> SELECT * FROM my_training order by docid limit 20;
>>>  docid | wordid | count
>>> -------+--------+-------
>>>      0 |     57 |     1
>>>      0 |     86 |     1
>>>      0 |      4 |     1
>>>      0 |     55 |     1
>>>      0 |     69 |     2
>>>      0 |     81 |     1
>>>      0 |     30 |     1
>>>      0 |     33 |     1
>>>      0 |     36 |     1
>>>      0 |     43 |     1
>>>      0 |     25 |     1
>>>      0 |     65 |     2
>>>      0 |     72 |     1
>>>      0 |      9 |     1
>>>      0 |      0 |     2
>>>      0 |     29 |     1
>>>      0 |     18 |     1
>>>      0 |     12 |     1
>>>      0 |     96 |     1
>>>      0 |     91 |     1
>>> (20 rows)
>>>
>>>
>>> 4) Run LDA
>>>
>>> DROP TABLE IF EXISTS my_model, my_outdata;
>>> SELECT madlib.lda_train( 'my_training',   -- Data table
>>>                          'my_model',      -- Model table
>>>                          'my_outdata',    -- Output table
>>>                          104,             -- Vocab size
>>>                          5,               -- Num topics
>>>                          10,              -- Num iterations
>>>                          5,               -- Alpha (Dirichlet param)
>>>                          0.01             -- Beta (Dirichlet param)
>>>                        );
>>>
>>>
>>> 5) Get topic description by top-k words
>>>
>>> DROP TABLE IF EXISTS my_topic_desc;
>>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>>                                   'my_training_vocabulary',
>>>                                   'my_topic_desc',
>>>                                   15);
>>>
>>> 6) Join topic description with term frequency
>>> SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING
>>> (wordid) WHERE docid=3 ORDER BY docid, wordid;
>>>
>>> docid | wordid | topicid
>>> -------+--------+---------
>>>      3 |      0 |       3
>>>      3 |      4 |       2
>>>      3 |      5 |       3
>>>      3 |     13 |       2
>>>      3 |     14 |       5
>>>      3 |     20 |       5
>>>      3 |     21 |       2
>>>      3 |     23 |       2
>>>      3 |     24 |       3
>>>      3 |     27 |       2
>>>      3 |     32 |       3
>>>      3 |     35 |       3
>>>      3 |     38 |       5
>>>      3 |     39 |       5
>>> etc
>>>
>>> Frank
>>>
>>>
>>>
>>> On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi Frank,
>>>>
>>>> I want to explain the LDA results for a single document (in this case
>>>> for docid = 6) by binding topicid into each wordid in the document.
>>>> The SQL query below gives exactly what I want but I am not sure if that
>>>> is the most effective way to build docid-wordid-topicid triples.
>>>>
>>>> SELECT docid, unnest((counts::text || ':' ||
>>>> words::text)::madlib.svec::float[]) AS wordid,
>>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid = 6;
>>>>
>>>> I have trained LDA with 'lda_output' as the output_data_table argument
>>>> in madlib.lda_train.
>>>>
>>>>
>>>> Regards, Markus
>>>>
>>>> 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fm...@pivotal.io>:
>>>>
>>>>> Markus,
>>>>>
>>>>> Please see example 4 in the user docs
>>>>> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
>>>>> which provides helper functions for learning more about the learned
>>>>> model.
>>>>>
>>>>> -- The topic description by top-k words
>>>>> DROP TABLE IF EXISTS my_topic_desc;
>>>>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>>>>                                   'my_training_vocabulary',
>>>>>                                   'my_topic_desc',
>>>>>                                   15);
>>>>> select * from my_topic_desc order by topicid, prob DESC;
>>>>>
>>>>> produces:
>>>>>
>>>>> topicid | wordid |        prob        |       word
>>>>> ---------+--------+--------------------+-------------------
>>>>>        1 |     69 |  0.181900726392252 | of
>>>>>        1 |     52 | 0.0608353510895884 | is
>>>>>        1 |     65 | 0.0608353510895884 | models
>>>>>        1 |     30 | 0.0305690072639225 | corpora
>>>>>        1 |      1 | 0.0305690072639225 | 1960s
>>>>>        1 |     57 | 0.0305690072639225 | latent
>>>>>
>>>>> Please let us know if this is of use, or you are looking for something
>>>>> else?
>>>>>
>>>>> Frank
>>>>>
>>>>>
>>>>> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I found a working but quite awkward way to form docid-wordid-topicid
>>>>>> pairing with a single SQL query:
>>>>>>
>>>>>> SELECT docid, unnest((counts::text || ':' ||
>>>>>> words::text)::madlib.svec::float[]) AS wordid,
>>>>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid = 6;
>>>>>>
>>>>>> Output:
>>>>>>
>>>>>>  docid | wordid | topicid
>>>>>> -------+--------+---------
>>>>>>      6 |   7386 |       3
>>>>>>      6 |  42021 |      17
>>>>>>      6 |  42021 |      17
>>>>>>      6 |   7705 |      12
>>>>>>      6 | 105334 |      16
>>>>>>      6 |  18083 |       3
>>>>>>      6 |  89364 |       3
>>>>>>      6 |  31073 |       3
>>>>>>      6 |  28934 |       3
>>>>>>      6 |  28934 |      16
>>>>>>      6 |  56286 |      16
>>>>>>      6 |  61921 |       3
>>>>>>      6 |  61921 |       3
>>>>>>      6 |  59142 |      17
>>>>>>      6 |  33364 |       3
>>>>>>      6 |  79035 |      17
>>>>>>      6 |  37792 |      11
>>>>>>      6 |  91823 |      11
>>>>>>      6 |  30422 |       3
>>>>>>      6 |  94672 |      17
>>>>>>      6 |  62107 |       3
>>>>>>      6 |  94673 |       2
>>>>>>      6 |  62080 |      16
>>>>>>      6 | 101046 |      17
>>>>>>      6 |   4379 |       8
>>>>>>      6 |   4379 |       8
>>>>>>      6 |   4379 |       8
>>>>>>      6 |   4379 |       8
>>>>>>      6 |   4379 |       8
>>>>>>      6 |  26503 |      12
>>>>>>      6 |  61105 |       3
>>>>>>      6 |  19193 |       3
>>>>>>      6 |  28929 |       3
>>>>>>
>>>>>>
>>>>>> Is there any simpler way to do that?
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Markus Paaso
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <ma...@gmail.com>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am having some problems reading the LDA output.
>>>>>>>
>>>>>>>
>>>>>>> Please see this row of madlib.lda_train output:
>>>>>>>
>>>>>>> docid            | 6
>>>>>>> wordcount        | 33
>>>>>>> words            | {7386,42021,7705,105334,18083,
>>>>>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>>>>>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>>>>>>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1
>>>>>>> ,1,1,1,1,1,1,5,1,1,1,1}
>>>>>>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>>>>>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>>>>>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>>>>>>
>>>>>>>
>>>>>>> It's hard to find which word ids are topic ids assigned to given when
>>>>>>>  *words* array have different length than *topic_assignment* array.
>>>>>>> It would be nice if *words* array was same length than
>>>>>>> *topic_assignment* array
>>>>>>>
>>>>>>> 1. What kind of SQL query would give a result with wordid - topicid
>>>>>>> pairs?
>>>>>>> I tried to match them by hand but failed for wordid: 28934. I wonder
>>>>>>> if a repeating wordid can have different topic assignments in a same
>>>>>>> document?
>>>>>>>
>>>>>>> wordid | topicid
>>>>>>> ----------------
>>>>>>> 7386   | 2
>>>>>>> 42021  | 16
>>>>>>> 7705   | 11
>>>>>>> 105334 | 15
>>>>>>> 18083  | 2
>>>>>>> 89364  | 2
>>>>>>> 31073  | 2
>>>>>>> 28934  | 2 OR 15 ?
>>>>>>> 56286  | 15
>>>>>>> 61921  | 2
>>>>>>> 59142  | 16
>>>>>>> 33364  | 2
>>>>>>> 79035  | 16
>>>>>>> 37792  | 10
>>>>>>> 91823  | 10
>>>>>>> 30422  | 2
>>>>>>> 94672  | 16
>>>>>>> 62107  | 2
>>>>>>> 94673  | 1
>>>>>>> 62080  | 15
>>>>>>> 101046 | 16
>>>>>>> 4379   | 7
>>>>>>> 26503  | 11
>>>>>>> 61105  | 2
>>>>>>> 19193  | 2
>>>>>>> 28929  | 2
>>>>>>>
>>>>>>>
>>>>>>> 2. Why is the *topic_assignment* using zero based indexing while
>>>>>>> other results use one base indexing?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Markus Paaso
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Markus Paaso
>>>>>> Tel: +358504067849 <+358%2050%204067849>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Markus Paaso
>>>> Tel: +358504067849 <+358%2050%204067849>
>>>>
>>>
>>>
>>
>>
>> --
>> Markus Paaso
>> Tel: +358504067849 <+358%2050%204067849>
>>
>

Re: LDA output format

Posted by Frank McQuillan <fm...@pivotal.io>.
Markus,

Thanks for the extra explanation.

I agree that having topic_assignment as a dense vector but words as a
sparse vector is not good practice, I am not sure why it was done that way
in the first place.  They should be consistent.

The reason sparse vectors are used I believe is to keep the LDA model state
as small as possible, since it is possible to hit the 1GB PostgreSQL field
limit in certain real world use cases.

So I think what you are doing seems reasonable.  I don’t have a simpler way
to do it at this point, given the current way LDA is built.

I have updated this JIRA
https://issues.apache.org/jira/browse/MADLIB-1160
proposing:

1) use consistent indexes (start at 1)
2) use consistent array types (sparse)

which hopefully will help some.

Frank


On Fri, Sep 22, 2017 at 12:18 AM, Markus Paaso <ma...@gmail.com>
wrote:

> Hi Frank,
>
> Thanks for the response.
> The solution you suggest doesn't fully answer to my question.
>
> I don't want to join input table with top-k wordids from each topic as
> this is not what lda_predict is doing. Am I right?
> I want to explain the results of madlib.lda_predict exactly.
> I want to explain why lda_predict gives some topic distribution for some
> set of wordids.
>
> I understand your approach but unfortunately it doesn't fulfill my needs.
> I have found the solution for my needs but and was just asking if you know
> better way to handle the results of lda_predict (or data in
> *output_data_table *of lda_train).
>
> So my question was more about how to construct wordid array from 'counts'
> and 'words' columns of madlib.lda_predict results or e.g. 'my_outdata' in
> your example.
> Now I am doing it with (counts::text || ':' || words::text)::madlib.svec::
> float[]
>
> The thing I don't understand in *output_data_table* format is that why
> *topic_assignment* is a dense vector but *words* is a sparse vector.
> I would suggest to change *words* format into dense vector to remove the
> need of playing with madlib.svec object to explain the lda_predict results
> exactly.
> And then the *counts* column is not needed anymore in *output_data_table*.
>
>
> Regards,
> Markus Paaso
>
>
>
>
>
>
> 2017-09-21 22:25 GMT+03:00 Frank McQuillan <fm...@pivotal.io>:
>
>> Markus,
>>
>> Sorry for the delay in response.  Below is another approach that involves
>> a JOIN between term frequency and an LDA output table.  It is a bit simpler
>> at least for me.
>>
>> Regarding the indexing, you are correct that it is inconsistent.  It
>> seems term frequency module uses 0 based indexing (e.g., docid) whereas LDA
>> uses 1 based indexing (e.g., topicid).  I created a JIRA to address this
>> https://issues.apache.org/jira/browse/MADLIB-1160
>> probably the best way is to standardize on 1 based indexing.
>>
>> ----
>>
>> 1) Create data set
>>
>> DROP TABLE IF EXISTS documents;
>> CREATE TABLE documents(docid INT4, contents TEXT);
>> INSERT INTO documents VALUES
>> (0, 'Statistical topic models are a class of Bayesian latent variable
>> models, originally developed for analyzing the semantic content of large
>> document corpora.'),
>> (1, 'By the late 1960s, the balance between pitching and hitting had
>> swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American
>> League batting title with an average of just .301, the lowest in history.'),
>> (2, 'Machine learning is closely related to and often overlaps with
>> computational statistics; a discipline that also specializes in
>> prediction-making. It has strong ties to mathematical optimization, which
>> deliver methods, theory and application domains to the field.'),
>> (3, 'California''s diverse geography ranges from the Sierra Nevada in the
>> east to the Pacific Coast in the west, from the Redwood–Douglas fir forests
>> of the northwest, to the Mojave Desert areas in the southeast. The center
>> of the state is dominated by the Central Valley, a major agricultural area.
>> ');
>>
>>
>> 2) Convert docs to a list of words
>>
>> ALTER TABLE documents ADD COLUMN words TEXT[];
>> UPDATE documents SET words = regexp_split_to_array(lower(contents),
>> E'[\\s+\\.\\,]');
>>
>>
>> 3) Term frequency
>>
>> DROP TABLE IF EXISTS my_training, my_training_vocabulary;
>> SELECT madlib.term_frequency('documents', 'docid', 'words',
>> 'my_training', TRUE);
>> SELECT * FROM my_training order by docid limit 20;
>>  docid | wordid | count
>> -------+--------+-------
>>      0 |     57 |     1
>>      0 |     86 |     1
>>      0 |      4 |     1
>>      0 |     55 |     1
>>      0 |     69 |     2
>>      0 |     81 |     1
>>      0 |     30 |     1
>>      0 |     33 |     1
>>      0 |     36 |     1
>>      0 |     43 |     1
>>      0 |     25 |     1
>>      0 |     65 |     2
>>      0 |     72 |     1
>>      0 |      9 |     1
>>      0 |      0 |     2
>>      0 |     29 |     1
>>      0 |     18 |     1
>>      0 |     12 |     1
>>      0 |     96 |     1
>>      0 |     91 |     1
>> (20 rows)
>>
>>
>> 4) Run LDA
>>
>> DROP TABLE IF EXISTS my_model, my_outdata;
>> SELECT madlib.lda_train( 'my_training',   -- Data table
>>                          'my_model',      -- Model table
>>                          'my_outdata',    -- Output table
>>                          104,             -- Vocab size
>>                          5,               -- Num topics
>>                          10,              -- Num iterations
>>                          5,               -- Alpha (Dirichlet param)
>>                          0.01             -- Beta (Dirichlet param)
>>                        );
>>
>>
>> 5) Get topic description by top-k words
>>
>> DROP TABLE IF EXISTS my_topic_desc;
>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>                                   'my_training_vocabulary',
>>                                   'my_topic_desc',
>>                                   15);
>>
>> 6) Join topic description with term frequency
>> SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING
>> (wordid) WHERE docid=3 ORDER BY docid, wordid;
>>
>> docid | wordid | topicid
>> -------+--------+---------
>>      3 |      0 |       3
>>      3 |      4 |       2
>>      3 |      5 |       3
>>      3 |     13 |       2
>>      3 |     14 |       5
>>      3 |     20 |       5
>>      3 |     21 |       2
>>      3 |     23 |       2
>>      3 |     24 |       3
>>      3 |     27 |       2
>>      3 |     32 |       3
>>      3 |     35 |       3
>>      3 |     38 |       5
>>      3 |     39 |       5
>> etc
>>
>> Frank
>>
>>
>>
>> On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <ma...@gmail.com>
>> wrote:
>>
>>> Hi Frank,
>>>
>>> I want to explain the LDA results for a single document (in this case
>>> for docid = 6) by binding topicid into each wordid in the document.
>>> The SQL query below gives exactly what I want but I am not sure if that
>>> is the most effective way to build docid-wordid-topicid triples.
>>>
>>> SELECT docid, unnest((counts::text || ':' ||
>>> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment)
>>> + 1 AS topicid FROM lda_output WHERE docid = 6;
>>>
>>> I have trained LDA with 'lda_output' as the output_data_table argument
>>> in madlib.lda_train.
>>>
>>>
>>> Regards, Markus
>>>
>>> 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fm...@pivotal.io>:
>>>
>>>> Markus,
>>>>
>>>> Please see example 4 in the user docs
>>>> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
>>>> which provides helper functions for learning more about the learned
>>>> model.
>>>>
>>>> -- The topic description by top-k words
>>>> DROP TABLE IF EXISTS my_topic_desc;
>>>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>>>                                   'my_training_vocabulary',
>>>>                                   'my_topic_desc',
>>>>                                   15);
>>>> select * from my_topic_desc order by topicid, prob DESC;
>>>>
>>>> produces:
>>>>
>>>> topicid | wordid |        prob        |       word
>>>> ---------+--------+--------------------+-------------------
>>>>        1 |     69 |  0.181900726392252 | of
>>>>        1 |     52 | 0.0608353510895884 | is
>>>>        1 |     65 | 0.0608353510895884 | models
>>>>        1 |     30 | 0.0305690072639225 | corpora
>>>>        1 |      1 | 0.0305690072639225 | 1960s
>>>>        1 |     57 | 0.0305690072639225 | latent
>>>>
>>>> Please let us know if this is of use, or you are looking for something
>>>> else?
>>>>
>>>> Frank
>>>>
>>>>
>>>> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I found a working but quite awkward way to form docid-wordid-topicid
>>>>> pairing with a single SQL query:
>>>>>
>>>>> SELECT docid, unnest((counts::text || ':' ||
>>>>> words::text)::madlib.svec::float[]) AS wordid,
>>>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid = 6;
>>>>>
>>>>> Output:
>>>>>
>>>>>  docid | wordid | topicid
>>>>> -------+--------+---------
>>>>>      6 |   7386 |       3
>>>>>      6 |  42021 |      17
>>>>>      6 |  42021 |      17
>>>>>      6 |   7705 |      12
>>>>>      6 | 105334 |      16
>>>>>      6 |  18083 |       3
>>>>>      6 |  89364 |       3
>>>>>      6 |  31073 |       3
>>>>>      6 |  28934 |       3
>>>>>      6 |  28934 |      16
>>>>>      6 |  56286 |      16
>>>>>      6 |  61921 |       3
>>>>>      6 |  61921 |       3
>>>>>      6 |  59142 |      17
>>>>>      6 |  33364 |       3
>>>>>      6 |  79035 |      17
>>>>>      6 |  37792 |      11
>>>>>      6 |  91823 |      11
>>>>>      6 |  30422 |       3
>>>>>      6 |  94672 |      17
>>>>>      6 |  62107 |       3
>>>>>      6 |  94673 |       2
>>>>>      6 |  62080 |      16
>>>>>      6 | 101046 |      17
>>>>>      6 |   4379 |       8
>>>>>      6 |   4379 |       8
>>>>>      6 |   4379 |       8
>>>>>      6 |   4379 |       8
>>>>>      6 |   4379 |       8
>>>>>      6 |  26503 |      12
>>>>>      6 |  61105 |       3
>>>>>      6 |  19193 |       3
>>>>>      6 |  28929 |       3
>>>>>
>>>>>
>>>>> Is there any simpler way to do that?
>>>>>
>>>>>
>>>>> Regards,
>>>>> Markus Paaso
>>>>>
>>>>>
>>>>>
>>>>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <ma...@gmail.com>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am having some problems reading the LDA output.
>>>>>>
>>>>>>
>>>>>> Please see this row of madlib.lda_train output:
>>>>>>
>>>>>> docid            | 6
>>>>>> wordcount        | 33
>>>>>> words            | {7386,42021,7705,105334,18083,
>>>>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>>>>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>>>>>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1
>>>>>> ,1,1,1,1,1,1,5,1,1,1,1}
>>>>>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>>>>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>>>>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>>>>>
>>>>>>
>>>>>> It's hard to find which word ids are topic ids assigned to given when
>>>>>>  *words* array have different length than *topic_assignment* array.
>>>>>> It would be nice if *words* array was same length than
>>>>>> *topic_assignment* array
>>>>>>
>>>>>> 1. What kind of SQL query would give a result with wordid - topicid
>>>>>> pairs?
>>>>>> I tried to match them by hand but failed for wordid: 28934. I wonder
>>>>>> if a repeating wordid can have different topic assignments in a same
>>>>>> document?
>>>>>>
>>>>>> wordid | topicid
>>>>>> ----------------
>>>>>> 7386   | 2
>>>>>> 42021  | 16
>>>>>> 7705   | 11
>>>>>> 105334 | 15
>>>>>> 18083  | 2
>>>>>> 89364  | 2
>>>>>> 31073  | 2
>>>>>> 28934  | 2 OR 15 ?
>>>>>> 56286  | 15
>>>>>> 61921  | 2
>>>>>> 59142  | 16
>>>>>> 33364  | 2
>>>>>> 79035  | 16
>>>>>> 37792  | 10
>>>>>> 91823  | 10
>>>>>> 30422  | 2
>>>>>> 94672  | 16
>>>>>> 62107  | 2
>>>>>> 94673  | 1
>>>>>> 62080  | 15
>>>>>> 101046 | 16
>>>>>> 4379   | 7
>>>>>> 26503  | 11
>>>>>> 61105  | 2
>>>>>> 19193  | 2
>>>>>> 28929  | 2
>>>>>>
>>>>>>
>>>>>> 2. Why is the *topic_assignment* using zero based indexing while
>>>>>> other results use one base indexing?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Markus Paaso
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Markus Paaso
>>>>> Tel: +358504067849 <+358%2050%204067849>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Markus Paaso
>>> Tel: +358504067849 <+358%2050%204067849>
>>>
>>
>>
>
>
> --
> Markus Paaso
> Tel: +358504067849 <+358%2050%204067849>
>

Re: LDA output format

Posted by Markus Paaso <ma...@gmail.com>.
Hi Frank,

Thanks for the response.
The solution you suggest doesn't fully answer to my question.

I don't want to join input table with top-k wordids from each topic as this
is not what lda_predict is doing. Am I right?
I want to explain the results of madlib.lda_predict exactly.
I want to explain why lda_predict gives some topic distribution for some
set of wordids.

I understand your approach but unfortunately it doesn't fulfill my needs.
I have found the solution for my needs but and was just asking if you know
better way to handle the results of lda_predict (or data in
*output_data_table *of lda_train).

So my question was more about how to construct wordid array from 'counts'
and 'words' columns of madlib.lda_predict results or e.g. 'my_outdata' in
your example.
Now I am doing it with (counts::text || ':' || words::text)::madlib.svec::
float[]

The thing I don't understand in *output_data_table* format is that why
*topic_assignment* is a dense vector but *words* is a sparse vector.
I would suggest to change *words* format into dense vector to remove the
need of playing with madlib.svec object to explain the lda_predict results
exactly.
And then the *counts* column is not needed anymore in *output_data_table*.


Regards,
Markus Paaso






2017-09-21 22:25 GMT+03:00 Frank McQuillan <fm...@pivotal.io>:

> Markus,
>
> Sorry for the delay in response.  Below is another approach that involves
> a JOIN between term frequency and an LDA output table.  It is a bit simpler
> at least for me.
>
> Regarding the indexing, you are correct that it is inconsistent.  It seems
> term frequency module uses 0 based indexing (e.g., docid) whereas LDA uses
> 1 based indexing (e.g., topicid).  I created a JIRA to address this
> https://issues.apache.org/jira/browse/MADLIB-1160
> probably the best way is to standardize on 1 based indexing.
>
> ----
>
> 1) Create data set
>
> DROP TABLE IF EXISTS documents;
> CREATE TABLE documents(docid INT4, contents TEXT);
> INSERT INTO documents VALUES
> (0, 'Statistical topic models are a class of Bayesian latent variable
> models, originally developed for analyzing the semantic content of large
> document corpora.'),
> (1, 'By the late 1960s, the balance between pitching and hitting had swung
> in favor of the pitchers. In 1968 Carl Yastrzemski won the American League
> batting title with an average of just .301, the lowest in history.'),
> (2, 'Machine learning is closely related to and often overlaps with
> computational statistics; a discipline that also specializes in
> prediction-making. It has strong ties to mathematical optimization, which
> deliver methods, theory and application domains to the field.'),
> (3, 'California''s diverse geography ranges from the Sierra Nevada in the
> east to the Pacific Coast in the west, from the Redwood–Douglas fir forests
> of the northwest, to the Mojave Desert areas in the southeast. The center
> of the state is dominated by the Central Valley, a major agricultural area.
> ');
>
>
> 2) Convert docs to a list of words
>
> ALTER TABLE documents ADD COLUMN words TEXT[];
> UPDATE documents SET words = regexp_split_to_array(lower(contents),
> E'[\\s+\\.\\,]');
>
>
> 3) Term frequency
>
> DROP TABLE IF EXISTS my_training, my_training_vocabulary;
> SELECT madlib.term_frequency('documents', 'docid', 'words',
> 'my_training', TRUE);
> SELECT * FROM my_training order by docid limit 20;
>  docid | wordid | count
> -------+--------+-------
>      0 |     57 |     1
>      0 |     86 |     1
>      0 |      4 |     1
>      0 |     55 |     1
>      0 |     69 |     2
>      0 |     81 |     1
>      0 |     30 |     1
>      0 |     33 |     1
>      0 |     36 |     1
>      0 |     43 |     1
>      0 |     25 |     1
>      0 |     65 |     2
>      0 |     72 |     1
>      0 |      9 |     1
>      0 |      0 |     2
>      0 |     29 |     1
>      0 |     18 |     1
>      0 |     12 |     1
>      0 |     96 |     1
>      0 |     91 |     1
> (20 rows)
>
>
> 4) Run LDA
>
> DROP TABLE IF EXISTS my_model, my_outdata;
> SELECT madlib.lda_train( 'my_training',   -- Data table
>                          'my_model',      -- Model table
>                          'my_outdata',    -- Output table
>                          104,             -- Vocab size
>                          5,               -- Num topics
>                          10,              -- Num iterations
>                          5,               -- Alpha (Dirichlet param)
>                          0.01             -- Beta (Dirichlet param)
>                        );
>
>
> 5) Get topic description by top-k words
>
> DROP TABLE IF EXISTS my_topic_desc;
> SELECT madlib.lda_get_topic_desc( 'my_model',
>                                   'my_training_vocabulary',
>                                   'my_topic_desc',
>                                   15);
>
> 6) Join topic description with term frequency
> SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING
> (wordid) WHERE docid=3 ORDER BY docid, wordid;
>
> docid | wordid | topicid
> -------+--------+---------
>      3 |      0 |       3
>      3 |      4 |       2
>      3 |      5 |       3
>      3 |     13 |       2
>      3 |     14 |       5
>      3 |     20 |       5
>      3 |     21 |       2
>      3 |     23 |       2
>      3 |     24 |       3
>      3 |     27 |       2
>      3 |     32 |       3
>      3 |     35 |       3
>      3 |     38 |       5
>      3 |     39 |       5
> etc
>
> Frank
>
>
>
> On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <ma...@gmail.com>
> wrote:
>
>> Hi Frank,
>>
>> I want to explain the LDA results for a single document (in this case for
>> docid = 6) by binding topicid into each wordid in the document.
>> The SQL query below gives exactly what I want but I am not sure if that
>> is the most effective way to build docid-wordid-topicid triples.
>>
>> SELECT docid, unnest((counts::text || ':' ||
>> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment)
>> + 1 AS topicid FROM lda_output WHERE docid = 6;
>>
>> I have trained LDA with 'lda_output' as the output_data_table argument
>> in madlib.lda_train.
>>
>>
>> Regards, Markus
>>
>> 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fm...@pivotal.io>:
>>
>>> Markus,
>>>
>>> Please see example 4 in the user docs
>>> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
>>> which provides helper functions for learning more about the learned
>>> model.
>>>
>>> -- The topic description by top-k words
>>> DROP TABLE IF EXISTS my_topic_desc;
>>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>>                                   'my_training_vocabulary',
>>>                                   'my_topic_desc',
>>>                                   15);
>>> select * from my_topic_desc order by topicid, prob DESC;
>>>
>>> produces:
>>>
>>> topicid | wordid |        prob        |       word
>>> ---------+--------+--------------------+-------------------
>>>        1 |     69 |  0.181900726392252 | of
>>>        1 |     52 | 0.0608353510895884 | is
>>>        1 |     65 | 0.0608353510895884 | models
>>>        1 |     30 | 0.0305690072639225 | corpora
>>>        1 |      1 | 0.0305690072639225 | 1960s
>>>        1 |     57 | 0.0305690072639225 | latent
>>>
>>> Please let us know if this is of use, or you are looking for something
>>> else?
>>>
>>> Frank
>>>
>>>
>>> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I found a working but quite awkward way to form docid-wordid-topicid
>>>> pairing with a single SQL query:
>>>>
>>>> SELECT docid, unnest((counts::text || ':' ||
>>>> words::text)::madlib.svec::float[]) AS wordid,
>>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid = 6;
>>>>
>>>> Output:
>>>>
>>>>  docid | wordid | topicid
>>>> -------+--------+---------
>>>>      6 |   7386 |       3
>>>>      6 |  42021 |      17
>>>>      6 |  42021 |      17
>>>>      6 |   7705 |      12
>>>>      6 | 105334 |      16
>>>>      6 |  18083 |       3
>>>>      6 |  89364 |       3
>>>>      6 |  31073 |       3
>>>>      6 |  28934 |       3
>>>>      6 |  28934 |      16
>>>>      6 |  56286 |      16
>>>>      6 |  61921 |       3
>>>>      6 |  61921 |       3
>>>>      6 |  59142 |      17
>>>>      6 |  33364 |       3
>>>>      6 |  79035 |      17
>>>>      6 |  37792 |      11
>>>>      6 |  91823 |      11
>>>>      6 |  30422 |       3
>>>>      6 |  94672 |      17
>>>>      6 |  62107 |       3
>>>>      6 |  94673 |       2
>>>>      6 |  62080 |      16
>>>>      6 | 101046 |      17
>>>>      6 |   4379 |       8
>>>>      6 |   4379 |       8
>>>>      6 |   4379 |       8
>>>>      6 |   4379 |       8
>>>>      6 |   4379 |       8
>>>>      6 |  26503 |      12
>>>>      6 |  61105 |       3
>>>>      6 |  19193 |       3
>>>>      6 |  28929 |       3
>>>>
>>>>
>>>> Is there any simpler way to do that?
>>>>
>>>>
>>>> Regards,
>>>> Markus Paaso
>>>>
>>>>
>>>>
>>>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <ma...@gmail.com>:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am having some problems reading the LDA output.
>>>>>
>>>>>
>>>>> Please see this row of madlib.lda_train output:
>>>>>
>>>>> docid            | 6
>>>>> wordcount        | 33
>>>>> words            | {7386,42021,7705,105334,18083,
>>>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>>>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>>>>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1
>>>>> ,1,1,1,1,1,1,5,1,1,1,1}
>>>>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>>>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>>>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>>>>
>>>>>
>>>>> It's hard to find which word ids are topic ids assigned to given when
>>>>> *words* array have different length than *topic_assignment* array.
>>>>> It would be nice if *words* array was same length than
>>>>> *topic_assignment* array
>>>>>
>>>>> 1. What kind of SQL query would give a result with wordid - topicid
>>>>> pairs?
>>>>> I tried to match them by hand but failed for wordid: 28934. I wonder
>>>>> if a repeating wordid can have different topic assignments in a same
>>>>> document?
>>>>>
>>>>> wordid | topicid
>>>>> ----------------
>>>>> 7386   | 2
>>>>> 42021  | 16
>>>>> 7705   | 11
>>>>> 105334 | 15
>>>>> 18083  | 2
>>>>> 89364  | 2
>>>>> 31073  | 2
>>>>> 28934  | 2 OR 15 ?
>>>>> 56286  | 15
>>>>> 61921  | 2
>>>>> 59142  | 16
>>>>> 33364  | 2
>>>>> 79035  | 16
>>>>> 37792  | 10
>>>>> 91823  | 10
>>>>> 30422  | 2
>>>>> 94672  | 16
>>>>> 62107  | 2
>>>>> 94673  | 1
>>>>> 62080  | 15
>>>>> 101046 | 16
>>>>> 4379   | 7
>>>>> 26503  | 11
>>>>> 61105  | 2
>>>>> 19193  | 2
>>>>> 28929  | 2
>>>>>
>>>>>
>>>>> 2. Why is the *topic_assignment* using zero based indexing while
>>>>> other results use one base indexing?
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Markus Paaso
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Markus Paaso
>>>> Tel: +358504067849 <+358%2050%204067849>
>>>>
>>>
>>>
>>
>>
>> --
>> Markus Paaso
>> Tel: +358504067849 <+358%2050%204067849>
>>
>
>


-- 
Markus Paaso
Tel: +358504067849

Re: LDA output format

Posted by Frank McQuillan <fm...@pivotal.io>.
Markus,

Sorry for the delay in response.  Below is another approach that involves a
JOIN between term frequency and an LDA output table.  It is a bit simpler
at least for me.

Regarding the indexing, you are correct that it is inconsistent.  It seems
term frequency module uses 0 based indexing (e.g., docid) whereas LDA uses
1 based indexing (e.g., topicid).  I created a JIRA to address this
https://issues.apache.org/jira/browse/MADLIB-1160
probably the best way is to standardize on 1 based indexing.

----

1) Create data set

DROP TABLE IF EXISTS documents;
CREATE TABLE documents(docid INT4, contents TEXT);
INSERT INTO documents VALUES
(0, 'Statistical topic models are a class of Bayesian latent variable
models, originally developed for analyzing the semantic content of large
document corpora.'),
(1, 'By the late 1960s, the balance between pitching and hitting had swung
in favor of the pitchers. In 1968 Carl Yastrzemski won the American League
batting title with an average of just .301, the lowest in history.'),
(2, 'Machine learning is closely related to and often overlaps with
computational statistics; a discipline that also specializes in
prediction-making. It has strong ties to mathematical optimization, which
deliver methods, theory and application domains to the field.'),
(3, 'California''s diverse geography ranges from the Sierra Nevada in the
east to the Pacific Coast in the west, from the Redwood–Douglas fir forests
of the northwest, to the Mojave Desert areas in the southeast. The center
of the state is dominated by the Central Valley, a major agricultural area.
');


2) Convert docs to a list of words

ALTER TABLE documents ADD COLUMN words TEXT[];
UPDATE documents SET words = regexp_split_to_array(lower(contents),
E'[\\s+\\.\\,]');


3) Term frequency

DROP TABLE IF EXISTS my_training, my_training_vocabulary;
SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training',
TRUE);
SELECT * FROM my_training order by docid limit 20;
 docid | wordid | count
-------+--------+-------
     0 |     57 |     1
     0 |     86 |     1
     0 |      4 |     1
     0 |     55 |     1
     0 |     69 |     2
     0 |     81 |     1
     0 |     30 |     1
     0 |     33 |     1
     0 |     36 |     1
     0 |     43 |     1
     0 |     25 |     1
     0 |     65 |     2
     0 |     72 |     1
     0 |      9 |     1
     0 |      0 |     2
     0 |     29 |     1
     0 |     18 |     1
     0 |     12 |     1
     0 |     96 |     1
     0 |     91 |     1
(20 rows)


4) Run LDA

DROP TABLE IF EXISTS my_model, my_outdata;
SELECT madlib.lda_train( 'my_training',   -- Data table
                         'my_model',      -- Model table
                         'my_outdata',    -- Output table
                         104,             -- Vocab size
                         5,               -- Num topics
                         10,              -- Num iterations
                         5,               -- Alpha (Dirichlet param)
                         0.01             -- Beta (Dirichlet param)
                       );


5) Get topic description by top-k words

DROP TABLE IF EXISTS my_topic_desc;
SELECT madlib.lda_get_topic_desc( 'my_model',
                                  'my_training_vocabulary',
                                  'my_topic_desc',
                                  15);

6) Join topic description with term frequency
SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING
(wordid) WHERE docid=3 ORDER BY docid, wordid;

docid | wordid | topicid
-------+--------+---------
     3 |      0 |       3
     3 |      4 |       2
     3 |      5 |       3
     3 |     13 |       2
     3 |     14 |       5
     3 |     20 |       5
     3 |     21 |       2
     3 |     23 |       2
     3 |     24 |       3
     3 |     27 |       2
     3 |     32 |       3
     3 |     35 |       3
     3 |     38 |       5
     3 |     39 |       5
etc

Frank



On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <ma...@gmail.com>
wrote:

> Hi Frank,
>
> I want to explain the LDA results for a single document (in this case for
> docid = 6) by binding topicid into each wordid in the document.
> The SQL query below gives exactly what I want but I am not sure if that is
> the most effective way to build docid-wordid-topicid triples.
>
> SELECT docid, unnest((counts::text || ':' || words::text)::madlib.svec::float[])
> AS wordid, unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE
> docid = 6;
>
> I have trained LDA with 'lda_output' as the output_data_table argument in
> madlib.lda_train.
>
>
> Regards, Markus
>
> 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fm...@pivotal.io>:
>
>> Markus,
>>
>> Please see example 4 in the user docs
>> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
>> which provides helper functions for learning more about the learned
>> model.
>>
>> -- The topic description by top-k words
>> DROP TABLE IF EXISTS my_topic_desc;
>> SELECT madlib.lda_get_topic_desc( 'my_model',
>>                                   'my_training_vocabulary',
>>                                   'my_topic_desc',
>>                                   15);
>> select * from my_topic_desc order by topicid, prob DESC;
>>
>> produces:
>>
>> topicid | wordid |        prob        |       word
>> ---------+--------+--------------------+-------------------
>>        1 |     69 |  0.181900726392252 | of
>>        1 |     52 | 0.0608353510895884 | is
>>        1 |     65 | 0.0608353510895884 | models
>>        1 |     30 | 0.0305690072639225 | corpora
>>        1 |      1 | 0.0305690072639225 | 1960s
>>        1 |     57 | 0.0305690072639225 | latent
>>
>> Please let us know if this is of use, or you are looking for something
>> else?
>>
>> Frank
>>
>>
>> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I found a working but quite awkward way to form docid-wordid-topicid
>>> pairing with a single SQL query:
>>>
>>> SELECT docid, unnest((counts::text || ':' ||
>>> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment)
>>> + 1 AS topicid FROM lda_output WHERE docid = 6;
>>>
>>> Output:
>>>
>>>  docid | wordid | topicid
>>> -------+--------+---------
>>>      6 |   7386 |       3
>>>      6 |  42021 |      17
>>>      6 |  42021 |      17
>>>      6 |   7705 |      12
>>>      6 | 105334 |      16
>>>      6 |  18083 |       3
>>>      6 |  89364 |       3
>>>      6 |  31073 |       3
>>>      6 |  28934 |       3
>>>      6 |  28934 |      16
>>>      6 |  56286 |      16
>>>      6 |  61921 |       3
>>>      6 |  61921 |       3
>>>      6 |  59142 |      17
>>>      6 |  33364 |       3
>>>      6 |  79035 |      17
>>>      6 |  37792 |      11
>>>      6 |  91823 |      11
>>>      6 |  30422 |       3
>>>      6 |  94672 |      17
>>>      6 |  62107 |       3
>>>      6 |  94673 |       2
>>>      6 |  62080 |      16
>>>      6 | 101046 |      17
>>>      6 |   4379 |       8
>>>      6 |   4379 |       8
>>>      6 |   4379 |       8
>>>      6 |   4379 |       8
>>>      6 |   4379 |       8
>>>      6 |  26503 |      12
>>>      6 |  61105 |       3
>>>      6 |  19193 |       3
>>>      6 |  28929 |       3
>>>
>>>
>>> Is there any simpler way to do that?
>>>
>>>
>>> Regards,
>>> Markus Paaso
>>>
>>>
>>>
>>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <ma...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> I am having some problems reading the LDA output.
>>>>
>>>>
>>>> Please see this row of madlib.lda_train output:
>>>>
>>>> docid            | 6
>>>> wordcount        | 33
>>>> words            | {7386,42021,7705,105334,18083,
>>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>>>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1
>>>> ,1,1,1,1,1,1,5,1,1,1,1}
>>>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>>>
>>>>
>>>> It's hard to find which word ids are topic ids assigned to given when
>>>> *words* array have different length than *topic_assignment* array.
>>>> It would be nice if *words* array was same length than
>>>> *topic_assignment* array
>>>>
>>>> 1. What kind of SQL query would give a result with wordid - topicid
>>>> pairs?
>>>> I tried to match them by hand but failed for wordid: 28934. I wonder if
>>>> a repeating wordid can have different topic assignments in a same document?
>>>>
>>>> wordid | topicid
>>>> ----------------
>>>> 7386   | 2
>>>> 42021  | 16
>>>> 7705   | 11
>>>> 105334 | 15
>>>> 18083  | 2
>>>> 89364  | 2
>>>> 31073  | 2
>>>> 28934  | 2 OR 15 ?
>>>> 56286  | 15
>>>> 61921  | 2
>>>> 59142  | 16
>>>> 33364  | 2
>>>> 79035  | 16
>>>> 37792  | 10
>>>> 91823  | 10
>>>> 30422  | 2
>>>> 94672  | 16
>>>> 62107  | 2
>>>> 94673  | 1
>>>> 62080  | 15
>>>> 101046 | 16
>>>> 4379   | 7
>>>> 26503  | 11
>>>> 61105  | 2
>>>> 19193  | 2
>>>> 28929  | 2
>>>>
>>>>
>>>> 2. Why is the *topic_assignment* using zero based indexing while other
>>>> results use one base indexing?
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Markus Paaso
>>>>
>>>
>>>
>>>
>>> --
>>> Markus Paaso
>>> Tel: +358504067849 <+358%2050%204067849>
>>>
>>
>>
>
>
> --
> Markus Paaso
> Tel: +358504067849 <+358%2050%204067849>
>

Re: LDA output format

Posted by Markus Paaso <ma...@gmail.com>.
Hi Frank,

I want to explain the LDA results for a single document (in this case for
docid = 6) by binding topicid into each wordid in the document.
The SQL query below gives exactly what I want but I am not sure if that is
the most effective way to build docid-wordid-topicid triples.

SELECT docid, unnest((counts::text || ':' ||
words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment) + 1
AS topicid FROM lda_output WHERE docid = 6;

I have trained LDA with 'lda_output' as the output_data_table argument in
madlib.lda_train.


Regards, Markus

2017-08-28 23:19 GMT+03:00 Frank McQuillan <fm...@pivotal.io>:

> Markus,
>
> Please see example 4 in the user docs
> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
> which provides helper functions for learning more about the learned model.
>
>
> -- The topic description by top-k words
> DROP TABLE IF EXISTS my_topic_desc;
> SELECT madlib.lda_get_topic_desc( 'my_model',
>                                   'my_training_vocabulary',
>                                   'my_topic_desc',
>                                   15);
> select * from my_topic_desc order by topicid, prob DESC;
>
> produces:
>
> topicid | wordid |        prob        |       word
> ---------+--------+--------------------+-------------------
>        1 |     69 |  0.181900726392252 | of
>        1 |     52 | 0.0608353510895884 | is
>        1 |     65 | 0.0608353510895884 | models
>        1 |     30 | 0.0305690072639225 | corpora
>        1 |      1 | 0.0305690072639225 | 1960s
>        1 |     57 | 0.0305690072639225 | latent
>
> Please let us know if this is of use, or you are looking for something
> else?
>
> Frank
>
>
> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I found a working but quite awkward way to form docid-wordid-topicid
>> pairing with a single SQL query:
>>
>> SELECT docid, unnest((counts::text || ':' ||
>> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment)
>> + 1 AS topicid FROM lda_output WHERE docid = 6;
>>
>> Output:
>>
>>  docid | wordid | topicid
>> -------+--------+---------
>>      6 |   7386 |       3
>>      6 |  42021 |      17
>>      6 |  42021 |      17
>>      6 |   7705 |      12
>>      6 | 105334 |      16
>>      6 |  18083 |       3
>>      6 |  89364 |       3
>>      6 |  31073 |       3
>>      6 |  28934 |       3
>>      6 |  28934 |      16
>>      6 |  56286 |      16
>>      6 |  61921 |       3
>>      6 |  61921 |       3
>>      6 |  59142 |      17
>>      6 |  33364 |       3
>>      6 |  79035 |      17
>>      6 |  37792 |      11
>>      6 |  91823 |      11
>>      6 |  30422 |       3
>>      6 |  94672 |      17
>>      6 |  62107 |       3
>>      6 |  94673 |       2
>>      6 |  62080 |      16
>>      6 | 101046 |      17
>>      6 |   4379 |       8
>>      6 |   4379 |       8
>>      6 |   4379 |       8
>>      6 |   4379 |       8
>>      6 |   4379 |       8
>>      6 |  26503 |      12
>>      6 |  61105 |       3
>>      6 |  19193 |       3
>>      6 |  28929 |       3
>>
>>
>> Is there any simpler way to do that?
>>
>>
>> Regards,
>> Markus Paaso
>>
>>
>>
>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <ma...@gmail.com>:
>>
>>> Hi,
>>>
>>> I am having some problems reading the LDA output.
>>>
>>>
>>> Please see this row of madlib.lda_train output:
>>>
>>> docid            | 6
>>> wordcount        | 33
>>> words            | {7386,42021,7705,105334,18083,
>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,5,1,1,1,1}
>>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>>
>>>
>>> It's hard to find which word ids are topic ids assigned to given when
>>> *words* array have different length than *topic_assignment* array.
>>> It would be nice if *words* array was same length than
>>> *topic_assignment* array
>>>
>>> 1. What kind of SQL query would give a result with wordid - topicid
>>> pairs?
>>> I tried to match them by hand but failed for wordid: 28934. I wonder if
>>> a repeating wordid can have different topic assignments in a same document?
>>>
>>> wordid | topicid
>>> ----------------
>>> 7386   | 2
>>> 42021  | 16
>>> 7705   | 11
>>> 105334 | 15
>>> 18083  | 2
>>> 89364  | 2
>>> 31073  | 2
>>> 28934  | 2 OR 15 ?
>>> 56286  | 15
>>> 61921  | 2
>>> 59142  | 16
>>> 33364  | 2
>>> 79035  | 16
>>> 37792  | 10
>>> 91823  | 10
>>> 30422  | 2
>>> 94672  | 16
>>> 62107  | 2
>>> 94673  | 1
>>> 62080  | 15
>>> 101046 | 16
>>> 4379   | 7
>>> 26503  | 11
>>> 61105  | 2
>>> 19193  | 2
>>> 28929  | 2
>>>
>>>
>>> 2. Why is the *topic_assignment* using zero based indexing while other
>>> results use one base indexing?
>>>
>>>
>>>
>>> Regards,
>>> Markus Paaso
>>>
>>
>>
>>
>> --
>> Markus Paaso
>> Tel: +358504067849 <+358%2050%204067849>
>>
>
>


-- 
Markus Paaso
Tel: +358504067849

Re: LDA output format

Posted by Frank McQuillan <fm...@pivotal.io>.
Markus,

Please see example 4 in the user docs
http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
which provides helper functions for learning more about the learned model.

-- The topic description by top-k words
DROP TABLE IF EXISTS my_topic_desc;
SELECT madlib.lda_get_topic_desc( 'my_model',
                                  'my_training_vocabulary',
                                  'my_topic_desc',
                                  15);
select * from my_topic_desc order by topicid, prob DESC;

produces:

topicid | wordid |        prob        |       word
---------+--------+--------------------+-------------------
       1 |     69 |  0.181900726392252 | of
       1 |     52 | 0.0608353510895884 | is
       1 |     65 | 0.0608353510895884 | models
       1 |     30 | 0.0305690072639225 | corpora
       1 |      1 | 0.0305690072639225 | 1960s
       1 |     57 | 0.0305690072639225 | latent

Please let us know if this is of use, or you are looking for something else?

Frank


On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <ma...@gmail.com>
wrote:

> Hi,
>
> I found a working but quite awkward way to form docid-wordid-topicid
> pairing with a single SQL query:
>
> SELECT docid, unnest((counts::text || ':' || words::text)::madlib.svec::float[])
> AS wordid, unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE
> docid = 6;
>
> Output:
>
>  docid | wordid | topicid
> -------+--------+---------
>      6 |   7386 |       3
>      6 |  42021 |      17
>      6 |  42021 |      17
>      6 |   7705 |      12
>      6 | 105334 |      16
>      6 |  18083 |       3
>      6 |  89364 |       3
>      6 |  31073 |       3
>      6 |  28934 |       3
>      6 |  28934 |      16
>      6 |  56286 |      16
>      6 |  61921 |       3
>      6 |  61921 |       3
>      6 |  59142 |      17
>      6 |  33364 |       3
>      6 |  79035 |      17
>      6 |  37792 |      11
>      6 |  91823 |      11
>      6 |  30422 |       3
>      6 |  94672 |      17
>      6 |  62107 |       3
>      6 |  94673 |       2
>      6 |  62080 |      16
>      6 | 101046 |      17
>      6 |   4379 |       8
>      6 |   4379 |       8
>      6 |   4379 |       8
>      6 |   4379 |       8
>      6 |   4379 |       8
>      6 |  26503 |      12
>      6 |  61105 |       3
>      6 |  19193 |       3
>      6 |  28929 |       3
>
>
> Is there any simpler way to do that?
>
>
> Regards,
> Markus Paaso
>
>
>
> 2017-08-11 15:23 GMT+03:00 Markus Paaso <ma...@gmail.com>:
>
>> Hi,
>>
>> I am having some problems reading the LDA output.
>>
>>
>> Please see this row of madlib.lda_train output:
>>
>> docid            | 6
>> wordcount        | 33
>> words            | {7386,42021,7705,105334,18083,
>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,5,1,1,1,1}
>> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>
>>
>> It's hard to find which word ids are topic ids assigned to given when
>> *words* array have different length than *topic_assignment* array.
>> It would be nice if *words* array was same length than *topic_assignment*
>> array
>>
>> 1. What kind of SQL query would give a result with wordid - topicid pairs?
>> I tried to match them by hand but failed for wordid: 28934. I wonder if a
>> repeating wordid can have different topic assignments in a same document?
>>
>> wordid | topicid
>> ----------------
>> 7386   | 2
>> 42021  | 16
>> 7705   | 11
>> 105334 | 15
>> 18083  | 2
>> 89364  | 2
>> 31073  | 2
>> 28934  | 2 OR 15 ?
>> 56286  | 15
>> 61921  | 2
>> 59142  | 16
>> 33364  | 2
>> 79035  | 16
>> 37792  | 10
>> 91823  | 10
>> 30422  | 2
>> 94672  | 16
>> 62107  | 2
>> 94673  | 1
>> 62080  | 15
>> 101046 | 16
>> 4379   | 7
>> 26503  | 11
>> 61105  | 2
>> 19193  | 2
>> 28929  | 2
>>
>>
>> 2. Why is the *topic_assignment* using zero based indexing while other
>> results use one base indexing?
>>
>>
>>
>> Regards,
>> Markus Paaso
>>
>
>
>
> --
> Markus Paaso
> Tel: +358504067849 <+358%2050%204067849>
>

Re: LDA output format

Posted by Markus Paaso <ma...@gmail.com>.
Hi,

I found a working but quite awkward way to form docid-wordid-topicid
pairing with a single SQL query:

SELECT docid, unnest((counts::text || ':' ||
words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment) + 1
AS topicid FROM lda_output WHERE docid = 6;

Output:

 docid | wordid | topicid
-------+--------+---------
     6 |   7386 |       3
     6 |  42021 |      17
     6 |  42021 |      17
     6 |   7705 |      12
     6 | 105334 |      16
     6 |  18083 |       3
     6 |  89364 |       3
     6 |  31073 |       3
     6 |  28934 |       3
     6 |  28934 |      16
     6 |  56286 |      16
     6 |  61921 |       3
     6 |  61921 |       3
     6 |  59142 |      17
     6 |  33364 |       3
     6 |  79035 |      17
     6 |  37792 |      11
     6 |  91823 |      11
     6 |  30422 |       3
     6 |  94672 |      17
     6 |  62107 |       3
     6 |  94673 |       2
     6 |  62080 |      16
     6 | 101046 |      17
     6 |   4379 |       8
     6 |   4379 |       8
     6 |   4379 |       8
     6 |   4379 |       8
     6 |   4379 |       8
     6 |  26503 |      12
     6 |  61105 |       3
     6 |  19193 |       3
     6 |  28929 |       3


Is there any simpler way to do that?


Regards,
Markus Paaso



2017-08-11 15:23 GMT+03:00 Markus Paaso <ma...@gmail.com>:

> Hi,
>
> I am having some problems reading the LDA output.
>
>
> Please see this row of madlib.lda_train output:
>
> docid            | 6
> wordcount        | 33
> words            | {7386,42021,7705,105334,18083,
> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
> counts           | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,5,1,1,1,1}
> topic_count      | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,
> 2,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>
>
> It's hard to find which word ids are topic ids assigned to given when
> *words* array have different length than *topic_assignment* array.
> It would be nice if *words* array was same length than *topic_assignment*
> array
>
> 1. What kind of SQL query would give a result with wordid - topicid pairs?
> I tried to match them by hand but failed for wordid: 28934. I wonder if a
> repeating wordid can have different topic assignments in a same document?
>
> wordid | topicid
> ----------------
> 7386   | 2
> 42021  | 16
> 7705   | 11
> 105334 | 15
> 18083  | 2
> 89364  | 2
> 31073  | 2
> 28934  | 2 OR 15 ?
> 56286  | 15
> 61921  | 2
> 59142  | 16
> 33364  | 2
> 79035  | 16
> 37792  | 10
> 91823  | 10
> 30422  | 2
> 94672  | 16
> 62107  | 2
> 94673  | 1
> 62080  | 15
> 101046 | 16
> 4379   | 7
> 26503  | 11
> 61105  | 2
> 19193  | 2
> 28929  | 2
>
>
> 2. Why is the *topic_assignment* using zero based indexing while other
> results use one base indexing?
>
>
>
> Regards,
> Markus Paaso
>



-- 
Markus Paaso
Tel: +358504067849