You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hivemall.apache.org by takuti <gi...@git.apache.org> on 2017/06/23 21:29:09 UTC

[GitHub] incubator-hivemall pull request #89: [HIVEMALL-120] Refactor on LDA/pLSA's m...

GitHub user takuti opened a pull request:

    https://github.com/apache/incubator-hivemall/pull/89

    [HIVEMALL-120] Refactor on LDA/pLSA's mini-batch & buffered iteration logic

    ## What changes were proposed in this pull request?
    
    Refactor LDA/pLSA implementation for better mini-batch & buffered iteration logic
    
    ## What type of PR is it?
    
    Refactoring
    
    ## What is the Jira issue?
    
    https://issues.apache.org/jira/browse/HIVEMALL-120
    
    ## How was this patch tested?
    
    Unit test & manual test on EMR
    
    ## How to use this feature?
    
    Nothing has been changed

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/takuti/incubator-hivemall topicmodel-refactor

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hivemall/pull/89.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #89
    
----
commit 740bf40fb4be3a3a5c2d35b88fdf622f64cc2bd6
Author: Takuya Kitazawa <k....@gmail.com>
Date:   2017-06-22T01:54:28Z

    Create auxiliary methods for mini-batch training

commit 1c51600bdbb79481d08891ad4cc6072fc950e09a
Author: Takuya Kitazawa <k....@gmail.com>
Date:   2017-06-22T21:19:34Z

    Create base classes for LDA/pLSA

commit 236be2c1edca2dae765381e659293709b10fa861
Author: Takuya Kitazawa <k....@gmail.com>
Date:   2017-06-22T21:45:41Z

    closeWithoutModelReset() -> finalizeTraining()

commit bd6e720ca2ecad3a20b7f89936cd5e9b20ef900c
Author: Takuya Kitazawa <k....@gmail.com>
Date:   2017-06-22T21:50:07Z

    Separate "recordBytes" and "requiredBytes"

commit fc47acae98575748749384d0e8142d5895cc5abf
Author: Takuya Kitazawa <k....@gmail.com>
Date:   2017-06-22T22:06:12Z

    Change the place where `perplexityPrev` is updated

commit bc4b9d3a13310d85c0fb4fc572d0338fefee0a8d
Author: Takuya Kitazawa <k....@gmail.com>
Date:   2017-06-22T22:21:48Z

    Update test cases

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #89: [HIVEMALL-120] Refactor on LDA/pLSA's mini-bat...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/89
  
    @takuti Adding an alias `-tol` for `-eps` is a good idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #89: [HIVEMALL-120] Refactor on LDA/pLSA's mini-bat...

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/89
  
    > Adding an alias -tol for -eps is a good idea.
    
    Sure.
    
    >  it accept the input format of Hivemall's feature vector
    
    Oh, it's not format matter by the way; `String` or `FeatureValue` in terms of type.
    
    Current implementation repeatedly parses feature-value-formatted `String` at [HERE](https://github.com/takuti/incubator-hivemall/blob/bc4b9d3a13310d85c0fb4fc572d0338fefee0a8d/core/src/main/java/hivemall/topicmodel/AbstractProbabilisticTopicModel.java#L66-L69) for every time mini-batch is passed. If we directly pass a two-dimensional array of `FeatureValue` as word counts, we can avoid the unnecessary parse operations.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #89: [HIVEMALL-120] Refactor on LDA/pLSA's mini-bat...

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/89
  
    Oh, you move very fast :) 
    
    > -tol alias
    
    Sure~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #89: [HIVEMALL-120] Refactor on LDA/pLSA's mini-bat...

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/89
  
    @myui How do you think about adding `-tol` option as alias of the `-eps` option as I mentioned #87 
    ?
    
    Also, if you have further ideas which improve the topic modeling UDFs, let you share and we discuss them here!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #89: [HIVEMALL-120] Refactor on LDA/pLSA's mini-bat...

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/89
  
    At first, I planned to replace `String[][] miniBatch` with things like `FeatureValue[][] miniBatch` similarly to the generic predictor implementation; that is, instead of dealing word counts as String e.g. "foo:10," I tried to represent them as `FeatureValue` for readability. However, since `PLSAPredictionUDAF` and `LDAPredictionUDAF` need to create OI for such word counts, Java standard string object is much more handy rather than `FeatureValue` objects.
    
    Thus, document is still represented as an array (or list) of String for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #89: [HIVEMALL-120] Refactor on LDA/pLSA's mini-bat...

Posted by takuti <gi...@git.apache.org>.
Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/89
  
    @myui Oh, you move very fast :) 
    
    > -tol alias
    
    Sure~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #89: [HIVEMALL-120] Refactor on LDA/pLSA's mini-bat...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/89
  
    Merged this PR. Create `[HIVEMALL-120-2]` for `-tol alias for -eps`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #89: [HIVEMALL-120] Refactor on LDA/pLSA's mini-bat...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/89
  
    No need to replace `String[][] miniBatch`. It already using `FeatureVector` for parsing feature vectors as thus it accept the input format of Hivemall's feature vector.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #89: [HIVEMALL-120] Refactor on LDA/pLSA's mini-bat...

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/89
  
    
    [![Coverage Status](https://:/builds/12110982/badge)](https://:/builds/12110982)
    
    Coverage increased (+0.2%) to 40.236% when pulling **bc4b9d3a13310d85c0fb4fc572d0338fefee0a8d on takuti:topicmodel-refactor** into **c06378a81723e3998f90c08ec7444ead5b6f2263 on apache:master**.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #89: [HIVEMALL-120] Refactor on LDA/pLSA's mini-bat...

Posted by myui <gi...@git.apache.org>.
Github user myui commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/89
  
    ah, parsing multiple times could be avoided.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall pull request #89: [HIVEMALL-120] Refactor on LDA/pLSA's m...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-hivemall/pull/89


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---