You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "John Conwell (Created) (JIRA)" <ji...@apache.org> on 2012/01/25 01:14:40 UTC

[jira] [Created] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering
------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-957
                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.6
            Reporter: John Conwell
             Fix For: 0.6


The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option.

Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination.  The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist.

For example, the following cmd line will reproduce this situation:
bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq

//the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
if (!processIdf && !shouldPrune) {
        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
          minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
} else if (processIdf) {
        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
          minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194379#comment-13194379 ] 

Grant Ingersoll commented on MAHOUT-957:
----------------------------------------

Whoa, backing up here a second.  If you are only asking for term frequencies, that kind of precludes doing Document frequency pruning.  I don't think this is a bug, but we probably should prevent this from even running to begin with as it is not a valid combination of inputs.  
                
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-957
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination.  The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> } else if (processIdf) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by "John Conwell (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196548#comment-13196548 ] 

John Conwell commented on MAHOUT-957:
-------------------------------------

Grant, yup that works great! thanks for the commit.
                
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-957
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>         Attachments: MAHOUT-957.patch
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination.  The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> } else if (processIdf) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by "Grant Ingersoll (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved MAHOUT-957.
------------------------------------

    Resolution: Fixed
    
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-957
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>         Attachments: MAHOUT-957.patch
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination.  The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> } else if (processIdf) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194377#comment-13194377 ] 

Grant Ingersoll commented on MAHOUT-957:
----------------------------------------

OK, I can reproduce the bug: {quote} Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/tmp/foo/tf-vectors-toprune at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:919) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:854) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:807) at org.apache.hadoop.mapreduce.Job.submit(Job.java:465) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495) at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:366) at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198) at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:277) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:55) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) {quote}
                
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-957
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination.  The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> } else if (processIdf) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194395#comment-13194395 ] 

Grant Ingersoll commented on MAHOUT-957:
----------------------------------------

then again, perhaps it is still all right.  Just b/c we don't care about the weighting in the vector doesn't mean we can't prune them out.
                
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-957
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination.  The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> } else if (processIdf) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195565#comment-13195565 ] 

Grant Ingersoll commented on MAHOUT-957:
----------------------------------------

I committed my patch.  John, does that fix things for you?
                
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-957
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>         Attachments: MAHOUT-957.patch
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination.  The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> } else if (processIdf) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by "Grant Ingersoll (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-957:
-----------------------------------

    Attachment: MAHOUT-957.patch

here's a fix.
                
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-957
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>         Attachments: MAHOUT-957.patch
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination.  The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> } else if (processIdf) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195610#comment-13195610 ] 

Hudson commented on MAHOUT-957:
-------------------------------

Integrated in Mahout-Quality #1325 (See [https://builds.apache.org/job/Mahout-Quality/1325/])
    MAHOUT-957: handle pruning of tf weights

gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1237072
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFilesTest.java

                
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-957
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>         Attachments: MAHOUT-957.patch
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination.  The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> } else if (processIdf) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by "Grant Ingersoll (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned MAHOUT-957:
--------------------------------------

    Assignee: Grant Ingersoll
    
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-957
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination.  The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> } else if (processIdf) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira