You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Vasil Vasilev (JIRA)" <ji...@apache.org> on 2011/05/05 15:03:03 UTC
[jira] [Created] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
High Document Frequency pruning for seq2sparse
----------------------------------------------
Key: MAHOUT-688
URL: https://issues.apache.org/jira/browse/MAHOUT-688
Project: Mahout
Issue Type: Improvement
Reporter: Vasil Vasilev
Priority: Minor
This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Grant Ingersoll (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll updated MAHOUT-688:
-----------------------------------
Attachment: MAHOUT-688.patch
Brings up to trunk. Still needs a test.
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Vasil Vasilev (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030477#comment-13030477 ]
Vasil Vasilev commented on MAHOUT-688:
--------------------------------------
Hi Grant,
Thanks for contributing to the code. One remark from my side: In fact the standard deviation was intentionally calculated in such a way, because I wanted to "force" a zero mean. I.e. I want to calculate the standard deviation in such a way that the words with document frequency (DF) near to the zero have highest probability of getting in. I.e. I imagine that for every word DF there is a -DF (DF with the opposite sign) and calculate the standard deviation in such a way. This ensures that only high DF words will be pruned.
Regards, Vasil
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Grant Ingersoll (Reopened) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll reopened MAHOUT-688:
------------------------------------
Didn't realize this patch dropped maxDFPercent. Adding it back in, but having the StdDev approach override it if both are present.
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Vasil Vasilev (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vasil Vasilev updated MAHOUT-688:
---------------------------------
Attachment: MAHOUT-688.patch
Last version of the patch with included standard deviation calculation against predefined mean.
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030481#comment-13030481 ]
Grant Ingersoll commented on MAHOUT-688:
----------------------------------------
OK, that makes reasonable sense. Perhaps then, what we can do is add another stddev/variance calc where the mean is provided. That way we can support both the more generic capabilities I added and you can still meet your goal.
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165335#comment-13165335 ]
Hudson commented on MAHOUT-688:
-------------------------------
Integrated in Mahout-Quality #1238 (See [https://builds.apache.org/job/Mahout-Quality/1238/])
MAHOUT-688: fix high df test
gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211874
Files :
* /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/HighDFWordsPrunerTest.java
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163285#comment-13163285 ]
Grant Ingersoll commented on MAHOUT-688:
----------------------------------------
Working on an update to trunk for this.
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163865#comment-13163865 ]
Hudson commented on MAHOUT-688:
-------------------------------
Integrated in Mahout-Quality #1230 (See [https://builds.apache.org/job/Mahout-Quality/1230/])
MAHOUT-688: Hook in high df pruning based on variance
gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1210994
Files :
* /mahout/trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stats
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stats/BasicStats.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stats/StandardDeviationCalculatorMapper.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stats/StandardDeviationCalculatorReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/HighDFWordsPruner.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/VectorizerConfig.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/pruner
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/pruner/PrunedPartialVectorMergeReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/pruner/WordsPrunerReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stats
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stats/BasicStatsTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DictionaryVectorizerTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/HighDFWordsPrunerTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFilesTest.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/Vector.java
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Vasil Vasilev (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078657#comment-13078657 ]
Vasil Vasilev commented on MAHOUT-688:
--------------------------------------
Hi Grant,
I will try to do so by the end of the week. If I am not ready I will be able finish it after 21-st of August
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Vasil Vasilev (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vasil Vasilev updated MAHOUT-688:
---------------------------------
Attachment: MAHOUT-688.patch
High DF words pruning implementation
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Priority: Minor
> Labels: Vectorization
> Attachments: MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll reassigned MAHOUT-688:
--------------------------------------
Assignee: Grant Ingersoll
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Attachments: MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Grant Ingersoll (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll resolved MAHOUT-688.
------------------------------------
Resolution: Fixed
Thanks, Vasil!
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Grant Ingersoll (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll resolved MAHOUT-688.
------------------------------------
Resolution: Fixed
Added back in maxDFPercent
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll updated MAHOUT-688:
-----------------------------------
Attachment: MAHOUT-688.patch
Reorgs the code a little bit to move std. dev. calculation to a common place. Also, puts in tests for std. dev. and fixes the std. dev. calculation, which _I am pretty sure_ was incorrectly calculated (it was missing the subtraction of the average in the sum of squares calc). Also added license headers and cleaned up the formatting a bit.
Since we are in code freeze, we can iterate on this a bit for 0.6.
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Vasil Vasilev (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079803#comment-13079803 ]
Vasil Vasilev commented on MAHOUT-688:
--------------------------------------
Hi Grant,
The patch should be ready now
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078452#comment-13078452 ]
Grant Ingersoll commented on MAHOUT-688:
----------------------------------------
Vasil,
Any time to update this? If you can put up a new patch, I can look to review it soon and get it committed.
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll updated MAHOUT-688:
-----------------------------------
Fix Version/s: 0.6
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-688) High Document Frequency pruning for
seq2sparse
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165188#comment-13165188 ]
Hudson commented on MAHOUT-688:
-------------------------------
Integrated in Mahout-Quality #1237 (See [https://builds.apache.org/job/Mahout-Quality/1237/])
MAHOUT-688: fix dropping of maxDFPercent
gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211828
Files :
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
> High Document Frequency pruning for seq2sparse
> ----------------------------------------------
>
> Key: MAHOUT-688
> URL: https://issues.apache.org/jira/browse/MAHOUT-688
> Project: Mahout
> Issue Type: Improvement
> Reporter: Vasil Vasilev
> Assignee: Grant Ingersoll
> Priority: Minor
> Labels: Vectorization
> Fix For: 0.6
>
> Attachments: MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch, MAHOUT-688.patch
>
>
> This improvement allows to prune the words with high document frequencies from the tf and tf-idf vectors produced by seq2sparse, based on the standard deviation of the words' document frequencies and specifying which rods to be pruned in a means of times this standard deviation. One good option is 3 times the standard deviation
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira