You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Alok Tanna <ta...@gmail.com> on 2016/01/14 20:31:58 UTC

Mahout : 20-newsgroups Classification Example : Split command

Hi ,

This request is in referece to the 20-newsgroups Classification Example on
the below link
https://mahout.apache.org/users/classification/twenty-newsgroups.html

I am able to run the example and get the results as mentioned in the link,
but when I am trying to do this example without the split command the
results are not same. Also when I try to run the other test data against
the same model results are not accurate.

Can we have this example run without the split command ?

Basically I am trying to do this :

I took both the datasets for training & testing.

Run below commands on both sets:
1. seqdirectory
2. seq2sparse

Now I  have vectors generated for both datasets.
- Run trainnb command using first dataset's vectors output. So instead of
training a model on 80% of the data, I am  using the whole dataset.
- Run testnb command using second dataset's vectors output. This is not the
20% of the data, it's completely new dataset, solely used for testing.

So instead of using mahout split, we I have specified separate dataset for
testing the model.

Results for this exercise is totally different then what I get when I am
using split command to split the data .


Thanks & Regards,

Alok R. Tanna

Re: Mahout : 20-newsgroups Classification Example : Split command

Posted by Andrew Palumbo <ap...@outlook.com>.

Correct,  actually the example will not work with another dataset after the seq2sparse step.  You'd have to write some vectorization methods which would use the dictionary-0 file from the output of your seq2sparse run on the training data to vectorize your out of sample text.  You could then run mahout testnb on that set


The scala example I mentioned earlier has an outline for writing such a method, although only goes as far as a single document tokenized into unigrams.




________________________________
From: Alok Tanna <ta...@gmail.com>
Sent: Thursday, January 14, 2016 5:00 PM
To: user@mahout.apache.org; ap.dev@outlook.com
Subject: Re: Mahout : 20-newsgroups Classification Example : Split command

Thank you Andrew for your inputs. I will try the example in Scala .

So this example of 20-newsgroup cannot be used with other data sets to test it once the split is done , is that right ?

Thanks,
Alok Tanna

On Thu, Jan 14, 2016 at 4:26 PM, Andrew Palumbo <ap...@outlook.com>> wrote:
The poor results you are seeing by testing are because you've run seq2sparse on each set independently.   This will create two different dictionaries, which serve as the vector index for each term in your vocabulary.  You must use the same dictionary that you trained your model on to vectorize your holdout set.  There is an example for doing this in Scala, using the new Mahout Samsara environment here:

http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html

See the "Define a function to tokenize and vectorize new text using our current dictionary" section.



________________________________________
From: Alok Tanna <ta...@gmail.com>>
Sent: Thursday, January 14, 2016 2:31 PM
To: user@mahout.apache.org<ma...@mahout.apache.org>
Subject: Mahout : 20-newsgroups Classification Example : Split command

Hi ,

This request is in referece to the 20-newsgroups Classification Example on
the below link
https://mahout.apache.org/users/classification/twenty-newsgroups.html

I am able to run the example and get the results as mentioned in the link,
but when I am trying to do this example without the split command the
results are not same. Also when I try to run the other test data against
the same model results are not accurate.

Can we have this example run without the split command ?

Basically I am trying to do this :

I took both the datasets for training & testing.

Run below commands on both sets:
1. seqdirectory
2. seq2sparse

Now I  have vectors generated for both datasets.
- Run trainnb command using first dataset's vectors output. So instead of
training a model on 80% of the data, I am  using the whole dataset.
- Run testnb command using second dataset's vectors output. This is not the
20% of the data, it's completely new dataset, solely used for testing.

So instead of using mahout split, we I have specified separate dataset for
testing the model.

Results for this exercise is totally different then what I get when I am
using split command to split the data .


Thanks & Regards,

Alok R. Tanna



--
Thanks & Regards,

Alok R. Tanna

Re: Mahout : 20-newsgroups Classification Example : Split command

Posted by Alok Tanna <ta...@gmail.com>.

Thank you Andrew for your inputs. I will try the example in Scala .

So this example of 20-newsgroup cannot be used with other data sets to test
it once the split is done , is that right ?

Thanks,
Alok Tanna

On Thu, Jan 14, 2016 at 4:26 PM, Andrew Palumbo <ap...@outlook.com> wrote:

> The poor results you are seeing by testing are because you've run
> seq2sparse on each set independently.   This will create two different
> dictionaries, which serve as the vector index for each term in your
> vocabulary.  You must use the same dictionary that you trained your model
> on to vectorize your holdout set.  There is an example for doing this in
> Scala, using the new Mahout Samsara environment here:
>
>
> http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html
>
> See the "Define a function to tokenize and vectorize new text using our
> current dictionary" section.
>
>
>
> ________________________________________
> From: Alok Tanna <ta...@gmail.com>
> Sent: Thursday, January 14, 2016 2:31 PM
> To: user@mahout.apache.org
> Subject: Mahout : 20-newsgroups Classification Example : Split command
>
> Hi ,
>
> This request is in referece to the 20-newsgroups Classification Example on
> the below link
> https://mahout.apache.org/users/classification/twenty-newsgroups.html
>
> I am able to run the example and get the results as mentioned in the link,
> but when I am trying to do this example without the split command the
> results are not same. Also when I try to run the other test data against
> the same model results are not accurate.
>
> Can we have this example run without the split command ?
>
> Basically I am trying to do this :
>
> I took both the datasets for training & testing.
>
> Run below commands on both sets:
> 1. seqdirectory
> 2. seq2sparse
>
> Now I  have vectors generated for both datasets.
> - Run trainnb command using first dataset's vectors output. So instead of
> training a model on 80% of the data, I am  using the whole dataset.
> - Run testnb command using second dataset's vectors output. This is not the
> 20% of the data, it's completely new dataset, solely used for testing.
>
> So instead of using mahout split, we I have specified separate dataset for
> testing the model.
>
> Results for this exercise is totally different then what I get when I am
> using split command to split the data .
>
>
> Thanks & Regards,
>
> Alok R. Tanna
>



-- 
Thanks & Regards,

Alok R. Tanna

Re: Mahout : 20-newsgroups Classification Example : Split command

Posted by Andrew Palumbo <ap...@outlook.com>.

The poor results you are seeing by testing are because you've run seq2sparse on each set independently.   This will create two different dictionaries, which serve as the vector index for each term in your vocabulary.  You must use the same dictionary that you trained your model on to vectorize your holdout set.  There is an example for doing this in Scala, using the new Mahout Samsara environment here: 

http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html

See the "Define a function to tokenize and vectorize new text using our current dictionary" section.  

   

________________________________________
From: Alok Tanna <ta...@gmail.com>
Sent: Thursday, January 14, 2016 2:31 PM
To: user@mahout.apache.org
Subject: Mahout : 20-newsgroups Classification Example : Split command

Hi ,

This request is in referece to the 20-newsgroups Classification Example on
the below link
https://mahout.apache.org/users/classification/twenty-newsgroups.html

I am able to run the example and get the results as mentioned in the link,
but when I am trying to do this example without the split command the
results are not same. Also when I try to run the other test data against
the same model results are not accurate.

Can we have this example run without the split command ?

Basically I am trying to do this :

I took both the datasets for training & testing.

Run below commands on both sets:
1. seqdirectory
2. seq2sparse

Now I  have vectors generated for both datasets.
- Run trainnb command using first dataset's vectors output. So instead of
training a model on 80% of the data, I am  using the whole dataset.
- Run testnb command using second dataset's vectors output. This is not the
20% of the data, it's completely new dataset, solely used for testing.

So instead of using mahout split, we I have specified separate dataset for
testing the model.

Results for this exercise is totally different then what I get when I am
using split command to split the data .


Thanks & Regards,

Alok R. Tanna