You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by John Conwell <jo...@iamjohn.me> on 2012/01/18 22:16:48 UTC

term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

I got latest from Trunk and built it, and when
running SparseVectorsFromSequenceFiles I noticed what I think is a bug.
 The SparseVectorsFromSequenceFiles throws an exception when you want term
frequency vectors output, with the maxDFSigma filtering option.

Basically the if / else if section shown below, will skip
calling DictionaryVectorizer.createTermFrequencyVectors when have that
combination.  The condition will create vectors when you want tf vectors
without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering,
but if you want tf vectors with maxDFSigma filtering, it totally skips over
the call to createTermFrequencyVectors, and later on throws an exception
because the vector input path doesn't exist.

Is this a known issue?  I'm assuming thats not the way its suposed to work,
correct?  If so, I think some sort of validation should break the user out
before they start processing anything

//at line ~267 in trunk

if (!processIdf && !shouldPrune) {

        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
outputDir, tfDirName, conf, minSupport, maxNGramSize,

          minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
sequentialAccessOutput, namedVectors);

} else if (processIdf) {

        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
outputDir, tfDirName, conf, minSupport, maxNGramSize,

          minLLRValue, -1.0f, false, reduceTasks, chunkSize,
sequentialAccessOutput, namedVectors);

}

-- 

Thanks,
John C

Re: term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by John Conwell <jo...@iamjohn.me>.

Yup...will do

On Tue, Jan 24, 2012 at 3:46 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Can you open a JIRA issue, if you haven't already, and mark it for 0.6?
>
> On Jan 23, 2012, at 10:49 AM, John Conwell wrote:
>
> > Any time you pass in that you want term frequency vs tfidf used as
> > weighting (-wt tf), combined with using maxDFSigma vs maxDFPercent
> > (--maxDFSigma 3) will cause the term vectors not to be created (as shown
> in
> > the code below)
> >
> > For example, the following cmd line will reproduce this situation:
> >
> > bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o
> > /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2
> > --minDF 2 --maxDFSigma 3 -seq
> >
> > Thanks,
> > John
> >
> > On Sun, Jan 22, 2012 at 3:00 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> >> What were the command/options you were passing in?
> >>
> >>
> >> On Jan 18, 2012, at 4:26 PM, John Conwell wrote:
> >>
> >>> I got latest from Trunk and built it, and when
> >>> running SparseVectorsFromSequenceFiles I noticed what I think is a bug.
> >>> The SparseVectorsFromSequenceFiles throws an exception when you want
> term
> >>> frequency vectors output, with the maxDFSigma filtering option.
> >>>
> >>> Basically the if / else if section shown below, will skip
> >>> calling DictionaryVectorizer.createTermFrequencyVectors when have that
> >>> combination.  The condition will create vectors when you want tf
> vectors
> >>> without maxDFSigma filtering, or tfidf vectors with maxDFSigma
> filtering,
> >>> but if you want tf vectors with maxDFSigma filtering, it totally skips
> >> over
> >>> the call to createTermFrequencyVectors, and later on throws an
> exception
> >>> because the vector input path doesn't exist.
> >>>
> >>> Is this a known issue?  I'm assuming thats not the way its suposed to
> >> work,
> >>> correct?  If so, I think some sort of validation should break the user
> >> out
> >>> before they start processing anything
> >>>
> >>> //at line ~267 in trunk
> >>>
> >>> if (!processIdf && !shouldPrune) {
> >>>
> >>>       DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> >>> outputDir, tfDirName, conf, minSupport, maxNGramSize,
> >>>
> >>>         minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
> >>> sequentialAccessOutput, namedVectors);
> >>>
> >>> } else if (processIdf) {
> >>>
> >>>       DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> >>> outputDir, tfDirName, conf, minSupport, maxNGramSize,
> >>>
> >>>         minLLRValue, -1.0f, false, reduceTasks, chunkSize,
> >>> sequentialAccessOutput, namedVectors);
> >>>
> >>> }
> >>>
> >>> --
> >>>
> >>> Thanks,
> >>> John C
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> -- John C
> >>
> >> --------------------------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com
> >>
> >>
> >>
> >>
> >
> >
> > --
> >
> > Thanks,
> > John C
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>
>


-- 

Thanks,
John C

Re: term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by Grant Ingersoll <gs...@apache.org>.

Can you open a JIRA issue, if you haven't already, and mark it for 0.6?

On Jan 23, 2012, at 10:49 AM, John Conwell wrote:

> Any time you pass in that you want term frequency vs tfidf used as
> weighting (-wt tf), combined with using maxDFSigma vs maxDFPercent
> (--maxDFSigma 3) will cause the term vectors not to be created (as shown in
> the code below)
> 
> For example, the following cmd line will reproduce this situation:
> 
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o
> /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2
> --minDF 2 --maxDFSigma 3 -seq
> 
> Thanks,
> John
> 
> On Sun, Jan 22, 2012 at 3:00 PM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> What were the command/options you were passing in?
>> 
>> 
>> On Jan 18, 2012, at 4:26 PM, John Conwell wrote:
>> 
>>> I got latest from Trunk and built it, and when
>>> running SparseVectorsFromSequenceFiles I noticed what I think is a bug.
>>> The SparseVectorsFromSequenceFiles throws an exception when you want term
>>> frequency vectors output, with the maxDFSigma filtering option.
>>> 
>>> Basically the if / else if section shown below, will skip
>>> calling DictionaryVectorizer.createTermFrequencyVectors when have that
>>> combination.  The condition will create vectors when you want tf vectors
>>> without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering,
>>> but if you want tf vectors with maxDFSigma filtering, it totally skips
>> over
>>> the call to createTermFrequencyVectors, and later on throws an exception
>>> because the vector input path doesn't exist.
>>> 
>>> Is this a known issue?  I'm assuming thats not the way its suposed to
>> work,
>>> correct?  If so, I think some sort of validation should break the user
>> out
>>> before they start processing anything
>>> 
>>> //at line ~267 in trunk
>>> 
>>> if (!processIdf && !shouldPrune) {
>>> 
>>>       DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
>>> outputDir, tfDirName, conf, minSupport, maxNGramSize,
>>> 
>>>         minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
>>> sequentialAccessOutput, namedVectors);
>>> 
>>> } else if (processIdf) {
>>> 
>>>       DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
>>> outputDir, tfDirName, conf, minSupport, maxNGramSize,
>>> 
>>>         minLLRValue, -1.0f, false, reduceTasks, chunkSize,
>>> sequentialAccessOutput, namedVectors);
>>> 
>>> }
>>> 
>>> --
>>> 
>>> Thanks,
>>> John C
>>> 
>>> 
>>> 
>>> 
>>> --
>>> 
>>> -- John C
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> 
>> 
>> 
>> 
> 
> 
> -- 
> 
> Thanks,
> John C

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by John Conwell <jo...@iamjohn.me>.

Any time you pass in that you want term frequency vs tfidf used as
weighting (-wt tf), combined with using maxDFSigma vs maxDFPercent
(--maxDFSigma 3) will cause the term vectors not to be created (as shown in
the code below)

For example, the following cmd line will reproduce this situation:

bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o
/Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2
--minDF 2 --maxDFSigma 3 -seq

Thanks,
John

On Sun, Jan 22, 2012 at 3:00 PM, Grant Ingersoll <gs...@apache.org>wrote:

> What were the command/options you were passing in?
>
>
> On Jan 18, 2012, at 4:26 PM, John Conwell wrote:
>
> > I got latest from Trunk and built it, and when
> > running SparseVectorsFromSequenceFiles I noticed what I think is a bug.
> > The SparseVectorsFromSequenceFiles throws an exception when you want term
> > frequency vectors output, with the maxDFSigma filtering option.
> >
> > Basically the if / else if section shown below, will skip
> > calling DictionaryVectorizer.createTermFrequencyVectors when have that
> > combination.  The condition will create vectors when you want tf vectors
> > without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering,
> > but if you want tf vectors with maxDFSigma filtering, it totally skips
> over
> > the call to createTermFrequencyVectors, and later on throws an exception
> > because the vector input path doesn't exist.
> >
> > Is this a known issue?  I'm assuming thats not the way its suposed to
> work,
> > correct?  If so, I think some sort of validation should break the user
> out
> > before they start processing anything
> >
> > //at line ~267 in trunk
> >
> > if (!processIdf && !shouldPrune) {
> >
> >        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> > outputDir, tfDirName, conf, minSupport, maxNGramSize,
> >
> >          minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
> > sequentialAccessOutput, namedVectors);
> >
> > } else if (processIdf) {
> >
> >        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> > outputDir, tfDirName, conf, minSupport, maxNGramSize,
> >
> >          minLLRValue, -1.0f, false, reduceTasks, chunkSize,
> > sequentialAccessOutput, namedVectors);
> >
> > }
> >
> > --
> >
> > Thanks,
> > John C
> >
> >
> >
> >
> > --
> >
> > -- John C
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>
>


-- 

Thanks,
John C

Re: term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by Grant Ingersoll <gs...@apache.org>.

What were the command/options you were passing in?


On Jan 18, 2012, at 4:26 PM, John Conwell wrote:

> I got latest from Trunk and built it, and when
> running SparseVectorsFromSequenceFiles I noticed what I think is a bug.
> The SparseVectorsFromSequenceFiles throws an exception when you want term
> frequency vectors output, with the maxDFSigma filtering option.
> 
> Basically the if / else if section shown below, will skip
> calling DictionaryVectorizer.createTermFrequencyVectors when have that
> combination.  The condition will create vectors when you want tf vectors
> without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering,
> but if you want tf vectors with maxDFSigma filtering, it totally skips over
> the call to createTermFrequencyVectors, and later on throws an exception
> because the vector input path doesn't exist.
> 
> Is this a known issue?  I'm assuming thats not the way its suposed to work,
> correct?  If so, I think some sort of validation should break the user out
> before they start processing anything
> 
> //at line ~267 in trunk
> 
> if (!processIdf && !shouldPrune) {
> 
>        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> outputDir, tfDirName, conf, minSupport, maxNGramSize,
> 
>          minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
> sequentialAccessOutput, namedVectors);
> 
> } else if (processIdf) {
> 
>        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> outputDir, tfDirName, conf, minSupport, maxNGramSize,
> 
>          minLLRValue, -1.0f, false, reduceTasks, chunkSize,
> sequentialAccessOutput, namedVectors);
> 
> }
> 
> -- 
> 
> Thanks,
> John C
> 
> 
> 
> 
> -- 
> 
> -- John C

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Posted by John Conwell <tu...@gmail.com>.

I got latest from Trunk and built it, and when
running SparseVectorsFromSequenceFiles I noticed what I think is a bug.
 The SparseVectorsFromSequenceFiles throws an exception when you want term
frequency vectors output, with the maxDFSigma filtering option.

Basically the if / else if section shown below, will skip
calling DictionaryVectorizer.createTermFrequencyVectors when have that
combination.  The condition will create vectors when you want tf vectors
without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering,
but if you want tf vectors with maxDFSigma filtering, it totally skips over
the call to createTermFrequencyVectors, and later on throws an exception
because the vector input path doesn't exist.

Is this a known issue?  I'm assuming thats not the way its suposed to work,
correct?  If so, I think some sort of validation should break the user out
before they start processing anything

//at line ~267 in trunk

if (!processIdf && !shouldPrune) {

        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
outputDir, tfDirName, conf, minSupport, maxNGramSize,

          minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
sequentialAccessOutput, namedVectors);

} else if (processIdf) {

        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
outputDir, tfDirName, conf, minSupport, maxNGramSize,

          minLLRValue, -1.0f, false, reduceTasks, chunkSize,
sequentialAccessOutput, namedVectors);

}

-- 

Thanks,
John C




-- 

-- John C