You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Mat Kelcey (JIRA)" <ji...@apache.org> on 2011/05/13 07:51:47 UTC

[jira] [Created] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Option to determine number of words for LDADriver from a specified dictionary
-----------------------------------------------------------------------------

                 Key: MAHOUT-695
                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
             Project: Mahout
          Issue Type: Improvement
          Components: Clustering
            Reporter: Mat Kelcey
            Priority: Minor


It bugged me that you needed to specify the number of words directly to the LDADriver 
eg ./bin/mahout lda \
     -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
     -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 

with this patch you can instead provide a dictionary; we just count the terms in the dictionary
eg ./bin/mahout lda \
     -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
     -o ./examples/bin/work/reuters-lda \
     -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
     -k 20 -ow -x 20 


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033469#comment-13033469 ] 

Jake Mannix commented on MAHOUT-695:
------------------------------------

Hmm... the dataset is always there, and for algorithms running on vectors, will always indeed consist of at least on vector, naturally.  So I'm guessing that we should be able to always do this, and in the grand scheme of things, will be not much slower than just supplying the param, so yeah, I agree, let's swap it out so that we just sniff and use the info we get from that!

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033457#comment-13033457 ] 

Jake Mannix commented on MAHOUT-695:
------------------------------------

But awesome work, thanks, this is great, I've also been often annoyed by this missing feature.  What would be even cooler?  If in case there was no dictionary, we just sniff the first vector in the data set, and ask for its getSize()!

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-695:
------------------------------

    Attachment:     (was: mahout-695.patch)

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695-sniff-vector.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-695:
------------------------------

    Attachment: mahout-695-sniff-vector.patch

a new version to include changes from
https://issues.apache.org/jira/browse/MAHOUT-694

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695-sniff-vector.patch, mahout-695-sniff-vector.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix updated MAHOUT-695:
-------------------------------

    Attachment: mahout-695.patch

Mat, make sure in the future, to generate your diffs in a git repo simply like this: "git diff --no-prefix", and it'll format great for application by a simple "patch -p0 < patchfile".  

Attached is your patch, regenerated to be like this.

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-695) Have LDADriver determine numWords from input vectors

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117243#comment-13117243 ] 

Sean Owen commented on MAHOUT-695:
----------------------------------

Hey Mat, looks like a good patch, and Jake likes it. I think it's OK to break callers in this regard. Is it probably OK to commit this, as far as you are concerned, or are there any other changes / concerns that should be addressed?
                
> Have LDADriver determine numWords from input vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch the ldadriver just checks a vector from the input to determine the size
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-695) Have LDADriver determine numWords from input vectors

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118100#comment-13118100 ] 

Hudson commented on MAHOUT-695:
-------------------------------

Integrated in Mahout-Quality #1072 (See [https://builds.apache.org/job/Mahout-Quality/1072/])
    MAHOUT-695 Compute number of terms rather than specify it

srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1177616
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java
* /mahout/trunk/examples/bin/build-reuters.sh

                
> Have LDADriver determine numWords from input vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch the ldadriver just checks a vector from the input to determine the size
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-695:
------------------------------

    Description: 
It bugged me that you needed to specify the number of words directly to the LDADriver 
eg ./bin/mahout lda \
     -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
     -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 

with this patch the ldadriver just checks a vector from the input to determine the size
eg ./bin/mahout lda \
     -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
     -o ./examples/bin/work/reuters-lda -k 20 -ow -x 20 


  was:
It bugged me that you needed to specify the number of words directly to the LDADriver 
eg ./bin/mahout lda \
     -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
     -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 

with this patch you can instead provide a dictionary; we just count the terms in the dictionary
eg ./bin/mahout lda \
     -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
     -o ./examples/bin/work/reuters-lda \
     -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
     -k 20 -ow -x 20 



> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch the ldadriver just checks a vector from the input to determine the size
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-695:
------------------------------

    Status: Open  (was: Patch Available)

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-695:
------------------------------

    Attachment:     (was: mahout-695-sniff-vector.patch)

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-695:
-----------------------------

    Fix Version/s: 0.6
         Assignee: Jake Mannix

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-695) Have LDADriver determine numWords from input vectors

Posted by "Mat Kelcey (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118083#comment-13118083 ] 

Mat Kelcey commented on MAHOUT-695:
-----------------------------------

I'll recheck the patch applies cleanly over the weekend, it's been awhile and I wouldn't be surprised if bin/build-reuters.sh has changed a bit.
                
> Have LDADriver determine numWords from input vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch the ldadriver just checks a vector from the input to determine the size
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-695) Have LDADriver determine numWords from input vectors

Posted by "Sean Owen (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-695.
------------------------------

    Resolution: Fixed
    
> Have LDADriver determine numWords from input vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch the ldadriver just checks a vector from the input to determine the size
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033468#comment-13033468 ] 

Mat Kelcey commented on MAHOUT-695:
-----------------------------------

yeah right, i didn't think of checking the first vector...
is there any case when sniffing the dataset wouldn't work?
if so it could just do that in all cases in which the --numWords or --dict parameters wouldn't be needed at all.

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-695:
------------------------------

    Attachment: mahout-695.patch

is this patch formatted correctly? 
am using a cloned git repo from github
and generated it with 'git format-patch -p'

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032833#comment-13032833 ] 

Mat Kelcey commented on MAHOUT-695:
-----------------------------------

is this patch formatted correctly? 
am using a cloned git repo from github
and generated it with 'git format-patch -p'

also any, and all, feedback welcome. 
i'd love to get my hands dirtier.

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040748#comment-13040748 ] 

Mat Kelcey commented on MAHOUT-695:
-----------------------------------

oops, patched from my wrong branch. i've delete them and will resubmit a patch tomorrow.

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033471#comment-13033471 ] 

Mat Kelcey commented on MAHOUT-695:
-----------------------------------

cool, i'll give it a crack tomorrow and see how i go.

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033651#comment-13033651 ] 

Mat Kelcey commented on MAHOUT-695:
-----------------------------------

here's another patch for determining the num words from the first vector. 

i've left numwords option in though as a form of deprecation so a warning can be given. the alternate of taking the option out would fail at startup complaining about the unknown arg. so depending on how much backwards compatibility you're after this might not be needed...

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695-sniff-vector.patch, mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-695:
------------------------------

    Comment: was deleted

(was: is this patch formatted correctly? 
am using a cloned git repo from github
and generated it with 'git format-patch -p')

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-695:
------------------------------

    Attachment: mahout-695-sniff-vector.patch

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695-sniff-vector.patch, mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-695:
------------------------------

    Affects Version/s: 0.5
               Status: Patch Available  (was: Open)

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Option to determine number of words for LDADriver from a specified dictionary

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-695:
------------------------------

    Attachment: mahout-695.patch

Have removed NUM_WORDS option completely which will break existing callers since it makes it an unknown parameter. (Not sure if backwards compability is an issue at this stage) Am happy to reinclude code to ignore it with a warning message that it's deprecated.

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch the ldadriver just checks a vector from the input to determine the size
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-695) Have LDADriver determine numWords from input vectors

Posted by "Mat Kelcey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-695:
------------------------------

    Summary: Have LDADriver determine numWords from input vectors  (was: Option to determine number of words for LDADriver from a specified dictionary)

> Have LDADriver determine numWords from input vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch the ldadriver just checks a vector from the input to determine the size
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira