You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@madlib.apache.org by Rahul Iyer <ri...@pivotal.io> on 2015/12/04 21:12:15 UTC

Re: MADlib LDA :(

Hi Vatsan

Thanks for the feedback!

Points 1 and 2 are bugs and not design choices - the fixes are minor and
have been completed. Before adding that to the repo, I would prefer if you
could create a JIRA
<https://issues.apache.org/jira/browse/MADLIB/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel>
so
that we have a record of the problem.

Point 3: If I understand correctly, you're looking to use the LDA function
without providing a vocabulary table. The LDA interface was not changed
when we added the term_frequency function - lda_train() does not require a
vocab table and can be called directly using your own term frequency table.

Note: lda_train() still has a limitation of hard-coded names for the input
table columns - would recommend you to add another JIRA to remove that
limitation.

Best,
Rahul


On Fri, Dec 4, 2015 at 9:46 AM, Srivatsan Ramanujam <sr...@pivotal.io>
wrote:

> Not sure if we reviewed this implementation's interface before but it has
> couple of  annoyances:
>
>    1. madlib.term_frequency() function (
>    http://doc.madlib.net/latest/group__grp__text__utilities.html) takes
>    the docid column and words columns as inputs, but this just fools us into
>    thinking that we could name our columns as whatever we want, coz it
>    complains if the columns are not actually named "docid" and "words"!
>    2. Secondly, it takes an output table as well as input (ex:
>    documents_tf), but it creates a temp table for the vocabulary
>    (therefore i can't specify a schema name like vatsan.documents_tf). This is
>    annoying for two reasons
>       1. The user can't immediately senses what's with the vocabulary
>       table and why is it a temp table while the documents_tf table itself is not.
>       2. If i have a real world dataset for LDA, my models are going to
>       run for quite sometime. I may even terminate one session and run the LDA
>       model in another session, this would mean the vocabulary temp table won't
>       be available in the other session (or would have gotten dropped)
>    3. Can i really create my own input table for LDA (one that has docid,
>    wordid, count)? If so, should i also create a vocabulary table (does madlib
>    look for this in the same schema as the input table)? It would be good to
>    provide this functionality as well, because at times we'd want to do our
>    own stemming/lemmatization and frequency filtering of tokens, before
>    passing it as input to the LDA. While the current implementation is an
>    input over the previous one (where the user had to do everything from
>    scratch), it has introduced some inconveniences as well.
>
> Please clarify.
>
> Thanks
> Vatsan
>
>
> --
>
> ____________________________________
>
> Srivatsan Ramanujam | Data Science
> Pivotal HQ - Palo Alto, CA
> Mobile: 650-483-5630
> ____________________________________
>

Re: MADlib LDA :(

Posted by Frank McQuillan <fm...@pivotal.io>.
So to close on this thread:

MADlib LDA term_frequency function bugs
https://issues.apache.org/jira/browse/MADLIB-933
is valid to remove hard coded column names and make the vocabulary table
not be a temp table.

https://issues.apache.org/jira/browse/MADLIB-934
is marked as won't fix since INT4 by design for memory management issues











On Fri, Dec 4, 2015 at 2:58 PM, Srivatsan Ramanujam <sr...@pivotal.io>
wrote:

> Hi Rahul,
> I've updated the second ticket as discussed:
> https://issues.apache.org/jira/browse/MADLIB-934
>
> Thanks
>
>
> On Fri, Dec 4, 2015 at 2:25 PM, Srivatsan Ramanujam <sramanujam@pivotal.io
> > wrote:
>
>> Great, submitted https://issues.apache.org/jira/browse/MADLIB-934 and
>> https://issues.apache.org/jira/browse/MADLIB-933
>>
>> On Fri, Dec 4, 2015 at 2:10 PM, Rahul Iyer <ri...@pivotal.io> wrote:
>>
>>> We have a new Apache JIRA instance:
>>> https://issues.apache.org/jira/browse/MADLIB/
>>> You'll need a login for this (suggest you keep this same as apache id,
>>> if you have one).
>>>
>>> Looks like we're doing an assert(num of cols == 3) - so it's because of
>>> the additional column. IMO that's a horrible check and should be removed.
>>> Please add an issue for this as well and I'll get rid of it.
>>>
>>> On Fri, Dec 4, 2015 at 1:55 PM, Srivatsan Ramanujam <
>>> sramanujam@pivotal.io> wrote:
>>>
>>>> Thanks for the response Rahul.
>>>>
>>>> By JIRA are you referring to the internal JIRA or do we have something
>>>> else given now it's on Apache Incubation?
>>>>
>>>> For #3, i have to check, but i essentially had created my own input
>>>> table which had 4 columns "docid", "wordid", "count" as well as a fourth
>>>> column "word" (corresponding to the raw token). Of these, the type of the
>>>> "count" column was bigint and not int. I am not sure what prompted the
>>>> lda_train function to throw an error it said the input table did not
>>>> contain docid, wordid and count columns, i did not check to see if it was
>>>> because of the data type mismatch of the count column or if it was due to
>>>> the additional column i had. Can you confirm which one is it?
>>>>
>>>>
>>>> On Fri, Dec 4, 2015 at 12:12 PM, Rahul Iyer <ri...@pivotal.io> wrote:
>>>>
>>>>> Hi Vatsan
>>>>>
>>>>> Thanks for the feedback!
>>>>>
>>>>> Points 1 and 2 are bugs and not design choices - the fixes are minor
>>>>> and have been completed. Before adding that to the repo, I would prefer if
>>>>> you could create a JIRA
>>>>> <https://issues.apache.org/jira/browse/MADLIB/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel> so
>>>>> that we have a record of the problem.
>>>>>
>>>>> Point 3: If I understand correctly, you're looking to use the LDA
>>>>> function without providing a vocabulary table. The LDA interface was not
>>>>> changed when we added the term_frequency function - lda_train() does not
>>>>> require a vocab table and can be called directly using your own term
>>>>> frequency table.
>>>>>
>>>>> Note: lda_train() still has a limitation of hard-coded names for the
>>>>> input table columns - would recommend you to add another JIRA to remove
>>>>> that limitation.
>>>>>
>>>>> Best,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Fri, Dec 4, 2015 at 9:46 AM, Srivatsan Ramanujam <
>>>>> sramanujam@pivotal.io> wrote:
>>>>>
>>>>>> Not sure if we reviewed this implementation's interface before but it
>>>>>> has couple of  annoyances:
>>>>>>
>>>>>>    1. madlib.term_frequency() function (
>>>>>>    http://doc.madlib.net/latest/group__grp__text__utilities.html)
>>>>>>    takes the docid column and words columns as inputs, but this just fools us
>>>>>>    into thinking that we could name our columns as whatever we want, coz it
>>>>>>    complains if the columns are not actually named "docid" and "words"!
>>>>>>    2. Secondly, it takes an output table as well as input (ex:
>>>>>>    documents_tf), but it creates a temp table for the vocabulary
>>>>>>    (therefore i can't specify a schema name like vatsan.documents_tf). This is
>>>>>>    annoying for two reasons
>>>>>>       1. The user can't immediately senses what's with the
>>>>>>       vocabulary table and why is it a temp table while the documents_tf table
>>>>>>       itself is not.
>>>>>>       2. If i have a real world dataset for LDA, my models are going
>>>>>>       to run for quite sometime. I may even terminate one session and run the LDA
>>>>>>       model in another session, this would mean the vocabulary temp table won't
>>>>>>       be available in the other session (or would have gotten dropped)
>>>>>>    3. Can i really create my own input table for LDA (one that has
>>>>>>    docid, wordid, count)? If so, should i also create a vocabulary table (does
>>>>>>    madlib look for this in the same schema as the input table)? It would be
>>>>>>    good to provide this functionality as well, because at times we'd want to
>>>>>>    do our own stemming/lemmatization and frequency filtering of tokens, before
>>>>>>    passing it as input to the LDA. While the current implementation is an
>>>>>>    input over the previous one (where the user had to do everything from
>>>>>>    scratch), it has introduced some inconveniences as well.
>>>>>>
>>>>>> Please clarify.
>>>>>>
>>>>>> Thanks
>>>>>> Vatsan
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ____________________________________
>>>>>>
>>>>>> Srivatsan Ramanujam | Data Science
>>>>>> Pivotal HQ - Palo Alto, CA
>>>>>> Mobile: 650-483-5630
>>>>>> ____________________________________
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________________
>>>>
>>>> Srivatsan Ramanujam | Data Science
>>>> Pivotal HQ - Palo Alto, CA
>>>> Mobile: 650-483-5630
>>>> ____________________________________
>>>>
>>>
>>>
>>
>>
>> --
>>
>> ____________________________________
>>
>> Srivatsan Ramanujam | Data Science
>> Pivotal HQ - Palo Alto, CA
>> Mobile: 650-483-5630
>> ____________________________________
>>
>
>
>
> --
>
> ____________________________________
>
> Srivatsan Ramanujam | Data Science
> Pivotal HQ - Palo Alto, CA
> Mobile: 650-483-5630
> ____________________________________
>

Re: MADlib LDA :(

Posted by Rahul Iyer <ri...@pivotal.io>.
We have a new Apache JIRA instance:
https://issues.apache.org/jira/browse/MADLIB/
You'll need a login for this (suggest you keep this same as apache id, if
you have one).

Looks like we're doing an assert(num of cols == 3) - so it's because of the
additional column. IMO that's a horrible check and should be removed.
Please add an issue for this as well and I'll get rid of it.

On Fri, Dec 4, 2015 at 1:55 PM, Srivatsan Ramanujam <sr...@pivotal.io>
wrote:

> Thanks for the response Rahul.
>
> By JIRA are you referring to the internal JIRA or do we have something
> else given now it's on Apache Incubation?
>
> For #3, i have to check, but i essentially had created my own input table
> which had 4 columns "docid", "wordid", "count" as well as a fourth column
> "word" (corresponding to the raw token). Of these, the type of the "count"
> column was bigint and not int. I am not sure what prompted the lda_train
> function to throw an error it said the input table did not contain docid,
> wordid and count columns, i did not check to see if it was because of the
> data type mismatch of the count column or if it was due to the additional
> column i had. Can you confirm which one is it?
>
>
> On Fri, Dec 4, 2015 at 12:12 PM, Rahul Iyer <ri...@pivotal.io> wrote:
>
>> Hi Vatsan
>>
>> Thanks for the feedback!
>>
>> Points 1 and 2 are bugs and not design choices - the fixes are minor and
>> have been completed. Before adding that to the repo, I would prefer if you
>> could create a JIRA
>> <https://issues.apache.org/jira/browse/MADLIB/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel> so
>> that we have a record of the problem.
>>
>> Point 3: If I understand correctly, you're looking to use the LDA
>> function without providing a vocabulary table. The LDA interface was not
>> changed when we added the term_frequency function - lda_train() does not
>> require a vocab table and can be called directly using your own term
>> frequency table.
>>
>> Note: lda_train() still has a limitation of hard-coded names for the
>> input table columns - would recommend you to add another JIRA to remove
>> that limitation.
>>
>> Best,
>> Rahul
>>
>>
>> On Fri, Dec 4, 2015 at 9:46 AM, Srivatsan Ramanujam <
>> sramanujam@pivotal.io> wrote:
>>
>>> Not sure if we reviewed this implementation's interface before but it
>>> has couple of  annoyances:
>>>
>>>    1. madlib.term_frequency() function (
>>>    http://doc.madlib.net/latest/group__grp__text__utilities.html) takes
>>>    the docid column and words columns as inputs, but this just fools us into
>>>    thinking that we could name our columns as whatever we want, coz it
>>>    complains if the columns are not actually named "docid" and "words"!
>>>    2. Secondly, it takes an output table as well as input (ex:
>>>    documents_tf), but it creates a temp table for the vocabulary
>>>    (therefore i can't specify a schema name like vatsan.documents_tf). This is
>>>    annoying for two reasons
>>>       1. The user can't immediately senses what's with the vocabulary
>>>       table and why is it a temp table while the documents_tf table itself is not.
>>>       2. If i have a real world dataset for LDA, my models are going to
>>>       run for quite sometime. I may even terminate one session and run the LDA
>>>       model in another session, this would mean the vocabulary temp table won't
>>>       be available in the other session (or would have gotten dropped)
>>>    3. Can i really create my own input table for LDA (one that has
>>>    docid, wordid, count)? If so, should i also create a vocabulary table (does
>>>    madlib look for this in the same schema as the input table)? It would be
>>>    good to provide this functionality as well, because at times we'd want to
>>>    do our own stemming/lemmatization and frequency filtering of tokens, before
>>>    passing it as input to the LDA. While the current implementation is an
>>>    input over the previous one (where the user had to do everything from
>>>    scratch), it has introduced some inconveniences as well.
>>>
>>> Please clarify.
>>>
>>> Thanks
>>> Vatsan
>>>
>>>
>>> --
>>>
>>> ____________________________________
>>>
>>> Srivatsan Ramanujam | Data Science
>>> Pivotal HQ - Palo Alto, CA
>>> Mobile: 650-483-5630
>>> ____________________________________
>>>
>>
>>
>
>
> --
>
> ____________________________________
>
> Srivatsan Ramanujam | Data Science
> Pivotal HQ - Palo Alto, CA
> Mobile: 650-483-5630
> ____________________________________
>