You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Vasil Vasilev (JIRA)" <ji...@apache.org> on 2011/05/05 15:03:03 UTC

[jira] [Created] (MAHOUT-688) High Document Frequency pruning for seq2sparse

High Document Frequency pruning for seq2sparse
----------------------------------------------

                 Key: MAHOUT-688
                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
             Project: Mahout
          Issue Type: Improvement
            Reporter: Vasil Vasilev
            Priority: Minor


This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Grant Ingersoll (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-688:
-----------------------------------

    Attachment: MAHOUT-688.patch

Brings up to trunk.  Still needs a test.
                
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Vasil Vasilev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030477#comment-13030477 ] 

Vasil Vasilev commented on MAHOUT-688:
--------------------------------------

Hi Grant,

Thanks for contributing to the code. One remark from my side: In fact the standard deviation was intentionally calculated in such a way, because I wanted to "force" a zero mean. I.e. I want to calculate the standard deviation in such a way that the words with document frequency (DF) near to the zero have highest probability of getting in. I.e. I imagine that for every word DF there is a -DF (DF with the opposite sign) and calculate the standard deviation in such a way. This ensures that only high DF words will be pruned.

Regards, Vasil

> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Grant Ingersoll (Reopened) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reopened MAHOUT-688:
------------------------------------


Didn't realize this patch dropped maxDFPercent.  Adding it back in, but having the StdDev approach override it if both are present.
                
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Vasil Vasilev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vasil Vasilev updated MAHOUT-688:
---------------------------------

    Attachment: MAHOUT-688.patch

Last version of the patch with included standard deviation calculation against predefined mean.

> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030481#comment-13030481 ] 

Grant Ingersoll commented on MAHOUT-688:
----------------------------------------

OK, that makes reasonable sense.  Perhaps then, what we can do is add another stddev/variance calc where the mean is provided.  That way we can support both the more generic capabilities I added and you can still meet your goal.

> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165335#comment-13165335 ] 

Hudson commented on MAHOUT-688:
-------------------------------

Integrated in Mahout-Quality #1238 (See [https://builds.apache.org/job/Mahout-Quality/1238/])
    MAHOUT-688: fix high df test

gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211874
Files : 
* /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/HighDFWordsPrunerTest.java

                
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163285#comment-13163285 ] 

Grant Ingersoll commented on MAHOUT-688:
----------------------------------------

Working on an update to trunk for this.
                
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163865#comment-13163865 ] 

Hudson commented on MAHOUT-688:
-------------------------------

Integrated in Mahout-Quality #1230 (See [https://builds.apache.org/job/Mahout-Quality/1230/])
    MAHOUT-688: Hook in high df pruning based on variance

gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1210994
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stats
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stats/BasicStats.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stats/StandardDeviationCalculatorMapper.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stats/StandardDeviationCalculatorReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/HighDFWordsPruner.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/VectorizerConfig.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/pruner
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/pruner/PrunedPartialVectorMergeReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/pruner/WordsPrunerReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stats
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stats/BasicStatsTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DictionaryVectorizerTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/HighDFWordsPrunerTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFilesTest.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/Vector.java

                
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Vasil Vasilev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078657#comment-13078657 ] 

Vasil Vasilev commented on MAHOUT-688:
--------------------------------------

Hi Grant,

I will try to do so by the end of the week. If I am not ready I will be able finish it after 21-st of August

> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Vasil Vasilev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vasil Vasilev updated MAHOUT-688:
---------------------------------

    Attachment: MAHOUT-688.patch

High DF words pruning implementation

> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Priority: Minor
>              Labels: Vectorization
>         Attachments: MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned MAHOUT-688:
--------------------------------------

    Assignee: Grant Ingersoll

> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>         Attachments: MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Grant Ingersoll (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved MAHOUT-688.
------------------------------------

    Resolution: Fixed

Thanks, Vasil!
                
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Grant Ingersoll (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved MAHOUT-688.
------------------------------------

    Resolution: Fixed

Added back in maxDFPercent
                
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-688:
-----------------------------------

    Attachment: MAHOUT-688.patch

Reorgs the code a little bit to move std. dev. calculation to a common place.  Also, puts in tests for std. dev. and fixes the std. dev. calculation, which _I am pretty sure_ was incorrectly calculated (it was missing the subtraction of the average in the sum of squares calc). Also added license headers and cleaned up the formatting a bit.

Since we are in code freeze, we can iterate on this a bit for 0.6.


> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Vasil Vasilev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079803#comment-13079803 ] 

Vasil Vasilev commented on MAHOUT-688:
--------------------------------------

Hi Grant,

The patch should be ready now

> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078452#comment-13078452 ] 

Grant Ingersoll commented on MAHOUT-688:
----------------------------------------

Vasil,

Any time to update this?  If you can put up a new patch, I can look to review it soon and get it committed.

> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-688:
-----------------------------------

    Fix Version/s: 0.6

> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for seq2sparse

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165188#comment-13165188 ] 

Hudson commented on MAHOUT-688:
-------------------------------

Integrated in Mahout-Quality #1237 (See [https://builds.apache.org/job/Mahout-Quality/1237/])
    MAHOUT-688: fix dropping of maxDFPercent

gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211828
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java

                
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
>                 Key: MAHOUT-688
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Vasil Vasilev
>            Assignee: Grant Ingersoll
>            Priority: Minor
>              Labels: Vectorization
>             Fix For: 0.6
>
>         Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira