You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/02/10 01:42:18 UTC

[jira] [Commented] (MADLIB-933) MADlib LDA term_frequency function bugs

    [ https://issues.apache.org/jira/browse/MADLIB-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140107#comment-15140107 ] 

ASF GitHub Bot commented on MADLIB-933:
---------------------------------------

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/incubator-madlib/pull/15

    Term Freq: Allow custom col names, avoid temp vocab

    JIRA: MADLIB-933
    
    - Fixed a minor bug that forced users to use "doc_id" as a column name.
    - Fixed an incorrect temp table output for the vocabulary.
    
    @mktal: Please review the PR and push to the apache remote after approval. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/iyerr3/incubator-madlib bugfix/term_freq_fixes

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/15.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15
    
----
commit 8c770f0535692cb3685b876e8f4e06d8c5844548
Author: Rahul Iyer <ri...@pivotal.io>
Date:   2015-12-07T22:01:04Z

    Term Freq: Allow custom col names, avoid temp vocab
    
    JIRA: MADLIB-933
    
    - Fixed a minor bug that forced users to use "doc_id" as a column name.
    - Fixed an incorrect temp table output for the vocabulary.

----


> MADlib LDA term_frequency function bugs
> ---------------------------------------
>
>                 Key: MADLIB-933
>                 URL: https://issues.apache.org/jira/browse/MADLIB-933
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Parallel Latent Dirichlet Allocation
>            Reporter: Srivatsan
>            Assignee: Rahul Iyer
>             Fix For: v1.9
>
>
> 1. madlib.term_frequency() function (http://doc.madlib.net/latest/group__grp__text__utilities.html) takes the docid column and words columns as inputs, but this just fools us into thinking that we could name our columns as whatever we want, coz it complains if the columns are not actually named "docid" and "words"!
> 2. Secondly, it takes an output table as well as input (ex: documents_tf), but it creates a temp table for the vocabulary (therefore i can't specify a schema name like vatsan.documents_tf). This is annoying for two reasons
> a. The user can't immediately senses what's with the vocabulary table and why is it a temp table while the documents_tf table itself is not.
> b. If i have a real world dataset for LDA, my models are going to run for quite sometime. I may even terminate one session and run the LDA model in another session, this would mean the vocabulary temp table won't be available in the other session (or would have gotten dropped)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)