You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "John Conwell (Created) (JIRA)" <ji...@apache.org> on 2012/01/27 20:06:10 UTC

[jira] [Created] (MAHOUT-962) minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile

minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile
-----------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-962
                 URL: https://issues.apache.org/jira/browse/MAHOUT-962
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.6
            Reporter: John Conwell
             Fix For: 0.6


This is similar to the same reasoning behind the fix for MAHOUT-957.  The desired output is term frequency vectors, but I want terms filtered by their min and max DF values. This might be valid in LDA, where tf vectors is desired for input, but filtering out the maxDFPercent is also useful.

Currently minDF and maxDFPercent are only used when calculating tfidf, and the original tv vectors are not updated to represent the term filtering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-962) minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile

Posted by "John Conwell (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195156#comment-13195156 ] 

John Conwell commented on MAHOUT-962:
-------------------------------------

Well, if the user wants to use sigma (stddev*sigma) to filter terms with high doc frequencies, yes that should work.  But what if they wanted to explicitly filter by DF percent to get rid high doc frequency terms?  Or more importantly, what if they wanted to use minDF to filter low doc frequency terms?  The sigma flag wont take care of those.

I think I'm sounding picky, but as I'm going through using LDA (and CVB LDA) I'm playing with different tweaks of the input args in order to get "better" quality topic models.
                
> minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-962
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-962
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>             Fix For: 0.6
>
>
> This is similar to the same reasoning behind the fix for MAHOUT-957.  The desired output is term frequency vectors, but I want terms filtered by their min and max DF values. This might be valid in LDA, where tf vectors is desired for input, but filtering out the maxDFPercent is also useful.
> Currently minDF and maxDFPercent are only used when calculating tfidf, and the original tv vectors are not updated to represent the term filtering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-962) minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195117#comment-13195117 ] 

Grant Ingersoll commented on MAHOUT-962:
----------------------------------------

John, I think my fix on MAHOUT-957 should work, right?
                
> minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-962
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-962
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>             Fix For: 0.6
>
>
> This is similar to the same reasoning behind the fix for MAHOUT-957.  The desired output is term frequency vectors, but I want terms filtered by their min and max DF values. This might be valid in LDA, where tf vectors is desired for input, but filtering out the maxDFPercent is also useful.
> Currently minDF and maxDFPercent are only used when calculating tfidf, and the original tv vectors are not updated to represent the term filtering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-962) minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile

Posted by "Andy Schlaikjer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276869#comment-13276869 ] 

Andy Schlaikjer commented on MAHOUT-962:
----------------------------------------

Hi John, Grant,

I ran into this issue last summer while working with Jake Mannix on CVB0 LDA. I ended up writing a Pig script to produce weighted term vectors, along with Elephant Bird's SequenceFileStorage and VectorWritableConverter utilities:

https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/load/SequenceFileLoader.java
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/mahout/VectorWritableConverter.java

Now that the above are open sourced, I'd like to get a generic Mahout-Pig submodule rolling, and perhaps include a version of my term vector script there. The script ended up being relatively concise, with more flexible term filtering and weighting mechanisms. Due to Pig's execution plan optimization, it also ran faster than comparable Mahout utils on my data.

Best,
Andy
@sagemintblue

                
> minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-962
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-962
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Priority: Minor
>             Fix For: 0.8
>
>
> This is similar to the same reasoning behind the fix for MAHOUT-957.  The desired output is term frequency vectors, but I want terms filtered by their min and max DF values. This might be valid in LDA, where tf vectors is desired for input, but filtering out the maxDFPercent is also useful.
> Currently minDF and maxDFPercent are only used when calculating tfidf, and the original tv vectors are not updated to represent the term filtering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-962) minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-962:
--------------------------------

    Fix Version/s:     (was: 0.7)
                   0.8
    
> minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-962
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-962
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Priority: Minor
>             Fix For: 0.8
>
>
> This is similar to the same reasoning behind the fix for MAHOUT-957.  The desired output is term frequency vectors, but I want terms filtered by their min and max DF values. This might be valid in LDA, where tf vectors is desired for input, but filtering out the maxDFPercent is also useful.
> Currently minDF and maxDFPercent are only used when calculating tfidf, and the original tv vectors are not updated to represent the term filtering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-962) minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile

Posted by "Dave Byrne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Byrne updated MAHOUT-962:
------------------------------

    Attachment: mahout_962.patch

patch to apply minDF and MaxDFPercent
                
> minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-962
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-962
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6, 0.7, 0.8
>            Reporter: John Conwell
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.8
>
>         Attachments: mahout_962.patch
>
>
> This is similar to the same reasoning behind the fix for MAHOUT-957.  The desired output is term frequency vectors, but I want terms filtered by their min and max DF values. This might be valid in LDA, where tf vectors is desired for input, but filtering out the maxDFPercent is also useful.
> Currently minDF and maxDFPercent are only used when calculating tfidf, and the original tv vectors are not updated to represent the term filtering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-962) minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195564#comment-13195564 ] 

Grant Ingersoll commented on MAHOUT-962:
----------------------------------------

Valid points.  I think, however, I'm going to move this to 0.7.  
                
> minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-962
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-962
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>             Fix For: 0.7
>
>
> This is similar to the same reasoning behind the fix for MAHOUT-957.  The desired output is term frequency vectors, but I want terms filtered by their min and max DF values. This might be valid in LDA, where tf vectors is desired for input, but filtering out the maxDFPercent is also useful.
> Currently minDF and maxDFPercent are only used when calculating tfidf, and the original tv vectors are not updated to represent the term filtering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-962) minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile

Posted by "Dave Byrne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Byrne updated MAHOUT-962:
------------------------------

               Labels: patch  (was: )
    Affects Version/s: 0.8
                       0.7
               Status: Patch Available  (was: Open)
    
> minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-962
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-962
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7, 0.6, 0.8
>            Reporter: John Conwell
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.8
>
>         Attachments: mahout_962.patch
>
>
> This is similar to the same reasoning behind the fix for MAHOUT-957.  The desired output is term frequency vectors, but I want terms filtered by their min and max DF values. This might be valid in LDA, where tf vectors is desired for input, but filtering out the maxDFPercent is also useful.
> Currently minDF and maxDFPercent are only used when calculating tfidf, and the original tv vectors are not updated to represent the term filtering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-962) minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile

Posted by "Grant Ingersoll (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-962:
-----------------------------------

         Priority: Minor  (was: Major)
    Fix Version/s:     (was: 0.6)
                   0.7
    
> minDF and maxDFPercent filtering doesnt get applied when output weight is tf in SpareVecorsFromSequenceFile
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-962
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-962
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Priority: Minor
>             Fix For: 0.7
>
>
> This is similar to the same reasoning behind the fix for MAHOUT-957.  The desired output is term frequency vectors, but I want terms filtered by their min and max DF values. This might be valid in LDA, where tf vectors is desired for input, but filtering out the maxDFPercent is also useful.
> Currently minDF and maxDFPercent are only used when calculating tfidf, and the original tv vectors are not updated to represent the term filtering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira