You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by mw <mw...@plista.com> on 2015/04/21 12:37:08 UTC

SparseVectorsFromSequenceFiles tfidf fail

Hello,

I am trying to get tfidf vectors from a corpus of 100k documents. I 
noticed that tfidf sequence file is empty, while the tf vectors are not.

Here is the log from SparseVectorsFromSequenceFiles:

INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
Maximum n-gram size is: 1
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
Minimum LLR value: 1.0
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Number 
of reduce tasks: 1
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
Tokenizing documents in /opt/seq
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
Creating Term Frequency Vectors
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
Calculating IDF
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Pruning

Here is the tfidf output dir:

root@test:[/opt/sparse/tfidf-vectors] # ll
total 20K
drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
-rw-r--r-- 1 tomcat7 tomcat7   90 Apr 21 12:27 part-r-00000
-rw-r--r-- 1 tomcat7 tomcat7   12 Apr 21 12:27 .part-r-00000.crc
-rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
-rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc

Here is the tf output dir:
root@test:[/opt/sparse/tf-vectors] # ll
total 3.7M
drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
-rw-r--r-- 1 tomcat7 tomcat7 3.6M Apr 21 12:27 part-r-00000
-rw-r--r-- 1 tomcat7 tomcat7  29K Apr 21 12:27 .part-r-00000.crc
-rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
-rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc

Here is the input dir:
root@test:[/opt/seq] # ll
total 81M
drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:25 .
drwxrwxrwx 9 tomcat7 root    4.0K Apr 21 12:25 ..
-rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00000
-rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00000.crc
-rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00001
-rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00001.crc
-rw-r--r-- 1 tomcat7 tomcat7  20M Apr 21 12:25 part-m-00002
-rw-r--r-- 1 tomcat7 tomcat7 155K Apr 21 12:25 .part-m-00002.crc
-rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:25 _SUCCESS
-rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:25 ._SUCCESS.crc


I am running it using the toolrunner with the following parameters:
-i /opt/seq -o /opt/sparse/ -nv --maxDFSigma 2.0 --weight tfidf

Any hints why it might be failing?

Best,
Max

Re: SparseVectorsFromSequenceFiles tfidf fail

Posted by mw <mw...@plista.com>.

Increasing maxDFSigma solved it. Does anybody know why that is?
On 04/22/2015 11:12 AM, mw wrote:
> Also i noticed that there must be something wrong when calculating the 
> variance since the file in stdcalc seems to be empty:
>
> root@test:[/opt/sparse/stdcalc] # ll
> total 20K
> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 22 11:02 .
> drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 22 11:02 ..
> -rw-r--r-- 1 tomcat7 tomcat7  155 Apr 22 11:02 part-r-00000
> -rw-r--r-- 1 tomcat7 tomcat7   12 Apr 22 11:02 .part-r-00000.crc
> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 22 11:02 _SUCCESS
> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 22 11:02 ._SUCCESS.crc
>
> On 04/21/2015 02:14 PM, mw wrote:
>> Mahout 0.10.0
>>
>> On 04/21/2015 02:05 PM, Suneel Marthi wrote:
>>> What's the Mahout Version# u r running with?
>>>
>>> On Tue, Apr 21, 2015 at 6:37 AM, mw <mw...@plista.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am trying to get tfidf vectors from a corpus of 100k documents. I
>>>> noticed that tfidf sequence file is empty, while the tf vectors are 
>>>> not.
>>>>
>>>> Here is the log from SparseVectorsFromSequenceFiles:
>>>>
>>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
>>>> Maximum
>>>> n-gram size is: 1
>>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
>>>> Minimum
>>>> LLR value: 1.0
>>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
>>>> Number
>>>> of reduce tasks: 1
>>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
>>>> Tokenizing documents in /opt/seq
>>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
>>>> Creating
>>>> Term Frequency Vectors
>>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
>>>> Calculating IDF
>>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
>>>> Pruning
>>>>
>>>> Here is the tfidf output dir:
>>>>
>>>> root@test:[/opt/sparse/tfidf-vectors] # ll
>>>> total 20K
>>>> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
>>>> drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
>>>> -rw-r--r-- 1 tomcat7 tomcat7   90 Apr 21 12:27 part-r-00000
>>>> -rw-r--r-- 1 tomcat7 tomcat7   12 Apr 21 12:27 .part-r-00000.crc
>>>> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
>>>> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc
>>>>
>>>> Here is the tf output dir:
>>>> root@test:[/opt/sparse/tf-vectors] # ll
>>>> total 3.7M
>>>> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
>>>> drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
>>>> -rw-r--r-- 1 tomcat7 tomcat7 3.6M Apr 21 12:27 part-r-00000
>>>> -rw-r--r-- 1 tomcat7 tomcat7  29K Apr 21 12:27 .part-r-00000.crc
>>>> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
>>>> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc
>>>>
>>>> Here is the input dir:
>>>> root@test:[/opt/seq] # ll
>>>> total 81M
>>>> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:25 .
>>>> drwxrwxrwx 9 tomcat7 root    4.0K Apr 21 12:25 ..
>>>> -rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00000
>>>> -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00000.crc
>>>> -rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00001
>>>> -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00001.crc
>>>> -rw-r--r-- 1 tomcat7 tomcat7  20M Apr 21 12:25 part-m-00002
>>>> -rw-r--r-- 1 tomcat7 tomcat7 155K Apr 21 12:25 .part-m-00002.crc
>>>> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:25 _SUCCESS
>>>> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:25 ._SUCCESS.crc
>>>>
>>>>
>>>> I am running it using the toolrunner with the following parameters:
>>>> -i /opt/seq -o /opt/sparse/ -nv --maxDFSigma 2.0 --weight tfidf
>>>>
>>>> Any hints why it might be failing?
>>>>
>>>> Best,
>>>> Max
>>>>
>>>>
>>
>

Re: SparseVectorsFromSequenceFiles tfidf fail

Posted by mw <mw...@plista.com>.

Also i noticed that there must be something wrong when calculating the 
variance since the file in stdcalc seems to be empty:

root@test:[/opt/sparse/stdcalc] # ll
total 20K
drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 22 11:02 .
drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 22 11:02 ..
-rw-r--r-- 1 tomcat7 tomcat7  155 Apr 22 11:02 part-r-00000
-rw-r--r-- 1 tomcat7 tomcat7   12 Apr 22 11:02 .part-r-00000.crc
-rw-r--r-- 1 tomcat7 tomcat7    0 Apr 22 11:02 _SUCCESS
-rw-r--r-- 1 tomcat7 tomcat7    8 Apr 22 11:02 ._SUCCESS.crc

On 04/21/2015 02:14 PM, mw wrote:
> Mahout 0.10.0
>
> On 04/21/2015 02:05 PM, Suneel Marthi wrote:
>> What's the Mahout Version# u r running with?
>>
>> On Tue, Apr 21, 2015 at 6:37 AM, mw <mw...@plista.com> wrote:
>>
>>> Hello,
>>>
>>> I am trying to get tfidf vectors from a corpus of 100k documents. I
>>> noticed that tfidf sequence file is empty, while the tf vectors are 
>>> not.
>>>
>>> Here is the log from SparseVectorsFromSequenceFiles:
>>>
>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
>>> Maximum
>>> n-gram size is: 1
>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
>>> Minimum
>>> LLR value: 1.0
>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
>>> Number
>>> of reduce tasks: 1
>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
>>> Tokenizing documents in /opt/seq
>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
>>> Creating
>>> Term Frequency Vectors
>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
>>> Calculating IDF
>>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: 
>>> Pruning
>>>
>>> Here is the tfidf output dir:
>>>
>>> root@test:[/opt/sparse/tfidf-vectors] # ll
>>> total 20K
>>> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
>>> drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
>>> -rw-r--r-- 1 tomcat7 tomcat7   90 Apr 21 12:27 part-r-00000
>>> -rw-r--r-- 1 tomcat7 tomcat7   12 Apr 21 12:27 .part-r-00000.crc
>>> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
>>> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc
>>>
>>> Here is the tf output dir:
>>> root@test:[/opt/sparse/tf-vectors] # ll
>>> total 3.7M
>>> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
>>> drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
>>> -rw-r--r-- 1 tomcat7 tomcat7 3.6M Apr 21 12:27 part-r-00000
>>> -rw-r--r-- 1 tomcat7 tomcat7  29K Apr 21 12:27 .part-r-00000.crc
>>> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
>>> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc
>>>
>>> Here is the input dir:
>>> root@test:[/opt/seq] # ll
>>> total 81M
>>> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:25 .
>>> drwxrwxrwx 9 tomcat7 root    4.0K Apr 21 12:25 ..
>>> -rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00000
>>> -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00000.crc
>>> -rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00001
>>> -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00001.crc
>>> -rw-r--r-- 1 tomcat7 tomcat7  20M Apr 21 12:25 part-m-00002
>>> -rw-r--r-- 1 tomcat7 tomcat7 155K Apr 21 12:25 .part-m-00002.crc
>>> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:25 _SUCCESS
>>> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:25 ._SUCCESS.crc
>>>
>>>
>>> I am running it using the toolrunner with the following parameters:
>>> -i /opt/seq -o /opt/sparse/ -nv --maxDFSigma 2.0 --weight tfidf
>>>
>>> Any hints why it might be failing?
>>>
>>> Best,
>>> Max
>>>
>>>
>

Re: SparseVectorsFromSequenceFiles tfidf fail

Posted by mw <mw...@plista.com>.

Mahout 0.10.0

On 04/21/2015 02:05 PM, Suneel Marthi wrote:
> What's the Mahout Version# u r running with?
>
> On Tue, Apr 21, 2015 at 6:37 AM, mw <mw...@plista.com> wrote:
>
>> Hello,
>>
>> I am trying to get tfidf vectors from a corpus of 100k documents. I
>> noticed that tfidf sequence file is empty, while the tf vectors are not.
>>
>> Here is the log from SparseVectorsFromSequenceFiles:
>>
>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Maximum
>> n-gram size is: 1
>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Minimum
>> LLR value: 1.0
>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Number
>> of reduce tasks: 1
>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
>> Tokenizing documents in /opt/seq
>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Creating
>> Term Frequency Vectors
>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
>> Calculating IDF
>> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Pruning
>>
>> Here is the tfidf output dir:
>>
>> root@test:[/opt/sparse/tfidf-vectors] # ll
>> total 20K
>> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
>> drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
>> -rw-r--r-- 1 tomcat7 tomcat7   90 Apr 21 12:27 part-r-00000
>> -rw-r--r-- 1 tomcat7 tomcat7   12 Apr 21 12:27 .part-r-00000.crc
>> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
>> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc
>>
>> Here is the tf output dir:
>> root@test:[/opt/sparse/tf-vectors] # ll
>> total 3.7M
>> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
>> drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
>> -rw-r--r-- 1 tomcat7 tomcat7 3.6M Apr 21 12:27 part-r-00000
>> -rw-r--r-- 1 tomcat7 tomcat7  29K Apr 21 12:27 .part-r-00000.crc
>> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
>> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc
>>
>> Here is the input dir:
>> root@test:[/opt/seq] # ll
>> total 81M
>> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:25 .
>> drwxrwxrwx 9 tomcat7 root    4.0K Apr 21 12:25 ..
>> -rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00000
>> -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00000.crc
>> -rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00001
>> -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00001.crc
>> -rw-r--r-- 1 tomcat7 tomcat7  20M Apr 21 12:25 part-m-00002
>> -rw-r--r-- 1 tomcat7 tomcat7 155K Apr 21 12:25 .part-m-00002.crc
>> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:25 _SUCCESS
>> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:25 ._SUCCESS.crc
>>
>>
>> I am running it using the toolrunner with the following parameters:
>> -i /opt/seq -o /opt/sparse/ -nv --maxDFSigma 2.0 --weight tfidf
>>
>> Any hints why it might be failing?
>>
>> Best,
>> Max
>>
>>

Re: SparseVectorsFromSequenceFiles tfidf fail

Posted by Suneel Marthi <sm...@apache.org>.

What's the Mahout Version# u r running with?

On Tue, Apr 21, 2015 at 6:37 AM, mw <mw...@plista.com> wrote:

> Hello,
>
> I am trying to get tfidf vectors from a corpus of 100k documents. I
> noticed that tfidf sequence file is empty, while the tf vectors are not.
>
> Here is the log from SparseVectorsFromSequenceFiles:
>
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Maximum
> n-gram size is: 1
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Minimum
> LLR value: 1.0
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Number
> of reduce tasks: 1
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
> Tokenizing documents in /opt/seq
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Creating
> Term Frequency Vectors
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
> Calculating IDF
> INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Pruning
>
> Here is the tfidf output dir:
>
> root@test:[/opt/sparse/tfidf-vectors] # ll
> total 20K
> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
> drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
> -rw-r--r-- 1 tomcat7 tomcat7   90 Apr 21 12:27 part-r-00000
> -rw-r--r-- 1 tomcat7 tomcat7   12 Apr 21 12:27 .part-r-00000.crc
> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc
>
> Here is the tf output dir:
> root@test:[/opt/sparse/tf-vectors] # ll
> total 3.7M
> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
> drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
> -rw-r--r-- 1 tomcat7 tomcat7 3.6M Apr 21 12:27 part-r-00000
> -rw-r--r-- 1 tomcat7 tomcat7  29K Apr 21 12:27 .part-r-00000.crc
> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:27 _SUCCESS
> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:27 ._SUCCESS.crc
>
> Here is the input dir:
> root@test:[/opt/seq] # ll
> total 81M
> drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:25 .
> drwxrwxrwx 9 tomcat7 root    4.0K Apr 21 12:25 ..
> -rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00000
> -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00000.crc
> -rw-r--r-- 1 tomcat7 tomcat7  31M Apr 21 12:25 part-m-00001
> -rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00001.crc
> -rw-r--r-- 1 tomcat7 tomcat7  20M Apr 21 12:25 part-m-00002
> -rw-r--r-- 1 tomcat7 tomcat7 155K Apr 21 12:25 .part-m-00002.crc
> -rw-r--r-- 1 tomcat7 tomcat7    0 Apr 21 12:25 _SUCCESS
> -rw-r--r-- 1 tomcat7 tomcat7    8 Apr 21 12:25 ._SUCCESS.crc
>
>
> I am running it using the toolrunner with the following parameters:
> -i /opt/seq -o /opt/sparse/ -nv --maxDFSigma 2.0 --weight tfidf
>
> Any hints why it might be failing?
>
> Best,
> Max
>
>