You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Adam Baron <ad...@gmail.com> on 2013/01/04 01:38:33 UTC

How to segment seq2sparse output into predefined training set and test set?

I went through the classify-20newsgroups.sh example and now want to use
Naïve Bayes to classify my own text corpus.  Only difference is that I'd
prefer to define which documents are in the training set and test set
versus using the split command.  My team prefers accuracy comparisons
between in-sample years and out-of-sample years as opposed to a random
selection across all years.  I don't believe I should run the seq2sparse
separately for each set since I'd end with different DFs and,
more concerning, different keys assigned to the same n-gram in
the dictionary.file-0.

Is there an easy way to achieve this with pre-built Mahout functionality?
 The only solution that comes to mind is to write a MapReduce program that
parses through the tfidf-vectors after running seq2sparse and sorts the
vectors into the separate training set and test set based on some
variable I put in the vector name.

Thanks,
        Adam

Re: How to segment seq2sparse output into predefined training set and test set?

Posted by Adam Baron <ad...@gmail.com>.

Thanks for the advice.  I tried out seq2encoded and that addressed my issue
of making the training set and test set use the same feature indices for
the same words.  However, I'm a little disappointed there is no dictionary
file produced by seq2encoded.  It would be nice to understand which word(s)
are associated with which feature index.

I hacked together some code to peak into the weightsPerLabelAndFeature
matrix of the NaiveBayesModel so I could understand what the Top N n-grams
were for each label, kind of like the ClusterDumper.  If I use seq2sparse,
I can use its dictionary to see the actual n-grams behind each feature
index.  I don't that's possible with seq2encoded.  It also doesn't appear
possible to use anything but 1-grams for seq2encoded; I'd prefer to be able
to also include 2-grams and 3-grams.

Has there been any discussion to include an existing dictionary file as a
parameter to seq2sparse?  If provided, the resultant TFIDF output would use
the same n-gram feature indices.  This would make running subsequent
classifications against an already trained Naïve Bayes Model as consistent
as with seq2encoded, but with the added traceability of the dictionary file.

Thanks,
          Adam

On Fri, Jan 4, 2013 at 8:16 AM, Robin Anil <ro...@gmail.com> wrote:

> Or use seq2encoded, its does randomized hashing instead of tfidf, the
> performance as I have seen is identical to the seq2sparse and much lower in
> model size (if you give it a lower dimension to project on)
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Fri, Jan 4, 2013 at 7:20 AM, Dan Filimon <dangeorge.filimon@gmail.com
> >wrote:
>
> > I haven't actually done this myself, but look at
> > DatasetSplitter.java's MarkPreferenceMapper.
> > That class is responsible for the partitioning and you can probably
> > just copy that class and replace the map() so that you look at the
> > year from the text somehow.
> >
> > So, while it's not exactly code-free, it's better than writing a new
> > program. :)
> >
> > On Fri, Jan 4, 2013 at 2:38 AM, Adam Baron <ad...@gmail.com>
> wrote:
> > > I went through the classify-20newsgroups.sh example and now want to use
> > > Naïve Bayes to classify my own text corpus.  Only difference is that
> I'd
> > > prefer to define which documents are in the training set and test set
> > > versus using the split command.  My team prefers accuracy comparisons
> > > between in-sample years and out-of-sample years as opposed to a random
> > > selection across all years.  I don't believe I should run the
> seq2sparse
> > > separately for each set since I'd end with different DFs and,
> > > more concerning, different keys assigned to the same n-gram in
> > > the dictionary.file-0.
> > >
> > > Is there an easy way to achieve this with pre-built Mahout
> functionality?
> > >  The only solution that comes to mind is to write a MapReduce program
> > that
> > > parses through the tfidf-vectors after running seq2sparse and sorts the
> > > vectors into the separate training set and test set based on some
> > > variable I put in the vector name.
> > >
> > > Thanks,
> > >         Adam
> >
>

Re: How to segment seq2sparse output into predefined training set and test set?

Posted by Robin Anil <ro...@gmail.com>.

Or use seq2encoded, its does randomized hashing instead of tfidf, the
performance as I have seen is identical to the seq2sparse and much lower in
model size (if you give it a lower dimension to project on)

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Fri, Jan 4, 2013 at 7:20 AM, Dan Filimon <da...@gmail.com>wrote:

> I haven't actually done this myself, but look at
> DatasetSplitter.java's MarkPreferenceMapper.
> That class is responsible for the partitioning and you can probably
> just copy that class and replace the map() so that you look at the
> year from the text somehow.
>
> So, while it's not exactly code-free, it's better than writing a new
> program. :)
>
> On Fri, Jan 4, 2013 at 2:38 AM, Adam Baron <ad...@gmail.com> wrote:
> > I went through the classify-20newsgroups.sh example and now want to use
> > Naïve Bayes to classify my own text corpus.  Only difference is that I'd
> > prefer to define which documents are in the training set and test set
> > versus using the split command.  My team prefers accuracy comparisons
> > between in-sample years and out-of-sample years as opposed to a random
> > selection across all years.  I don't believe I should run the seq2sparse
> > separately for each set since I'd end with different DFs and,
> > more concerning, different keys assigned to the same n-gram in
> > the dictionary.file-0.
> >
> > Is there an easy way to achieve this with pre-built Mahout functionality?
> >  The only solution that comes to mind is to write a MapReduce program
> that
> > parses through the tfidf-vectors after running seq2sparse and sorts the
> > vectors into the separate training set and test set based on some
> > variable I put in the vector name.
> >
> > Thanks,
> >         Adam
>

Re: How to segment seq2sparse output into predefined training set and test set?

Posted by Dan Filimon <da...@gmail.com>.

I haven't actually done this myself, but look at
DatasetSplitter.java's MarkPreferenceMapper.
That class is responsible for the partitioning and you can probably
just copy that class and replace the map() so that you look at the
year from the text somehow.

So, while it's not exactly code-free, it's better than writing a new program. :)

On Fri, Jan 4, 2013 at 2:38 AM, Adam Baron <ad...@gmail.com> wrote:
> I went through the classify-20newsgroups.sh example and now want to use
> Naïve Bayes to classify my own text corpus.  Only difference is that I'd
> prefer to define which documents are in the training set and test set
> versus using the split command.  My team prefers accuracy comparisons
> between in-sample years and out-of-sample years as opposed to a random
> selection across all years.  I don't believe I should run the seq2sparse
> separately for each set since I'd end with different DFs and,
> more concerning, different keys assigned to the same n-gram in
> the dictionary.file-0.
>
> Is there an easy way to achieve this with pre-built Mahout functionality?
>  The only solution that comes to mind is to write a MapReduce program that
> parses through the tfidf-vectors after running seq2sparse and sorts the
> vectors into the separate training set and test set based on some
> variable I put in the vector name.
>
> Thanks,
>         Adam