You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2018/01/11 19:33:00 UTC

[jira] [Comment Edited] (MADLIB-1160) Usability changes for LDA

    [ https://issues.apache.org/jira/browse/MADLIB-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322817#comment-16322817 ] 

Frank McQuillan edited comment on MADLIB-1160 at 1/11/18 7:32 PM:
------------------------------------------------------------------

Yes that is related to my comment 3) above.  We *should* introduce all relevant LDA functions at the top of the user docs, and not just use them in the examples without explaining what they are.

Which helper function for lda and tf are you referring to [~jingyimei] ?


was (Author: fmcquillan):
Yes that is related to my comment 3) above.  We *should* introduce all relevant LDA functions at the top of the user docs, and not just use them in the examples.

Which helper function for lda and tf are you referring to [~jingyimei] ?

> Usability changes for LDA
> -------------------------
>
>                 Key: MADLIB-1160
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1160
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Minor
>             Fix For: v1.14
>
>
> Context
> Please see this thread from the user mailing list
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E
> Tasks
> 1)  Term frequency
> http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
> and LDA
> http://madlib.apache.org/docs/latest/group__grp__lda.html
> should both creates indexes that start at 1, to make them consistent with other MADlib modules.  One or both of these currently create indexes starting at 0.
> 2)  In the output_data_table  *topic_assignment* is a dense vector but *words* is a sparse vector (svec).
> We should change *topic_assignment* to be a sparse vector to be consistent.
> Note:  the reason sparse vectors were used in the first place (I think) is to keep the model state as small as possible, so it is preferred to dense format in this case., although svecs are a bit harder to work with.  We have hit the Postgres 1GB field limit size in some use cases.
> 3) The user docs could also use some cleanup at the same time.  E.g., helper functions are used in the examples but not described above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)