You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Keith Thompson <kt...@binghamton.edu> on 2011/05/22 21:10:38 UTC

file input formats

If I have some numerical data (e.g., the data at
http://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data)
and want to run a Mahout classification algorithm on that data, what steps
do I need to take in order to put the data into the correct input format?  I
have read that most everything requires a sequence file but I'm not sure
that I still understand what that is.  Do I need to provide a key for each
row in this dataset (and the rest of the row sans the final column would be
the value)?

Re: file input formats

Posted by Keith Thompson <kt...@binghamton.edu>.
Thanks. I appreciate it.  I think I need to buy a copy of that!

On Sun, May 22, 2011 at 8:21 PM, Ted Dunning <te...@gmail.com> wrote:

> It is blowing my own horn to some extent, but take a look at the Mahout in
> Action book.
>
> http://www.manning.com/owen/
>
> Also, there are several articles with examples for the Naive Bayesian
> classifiers.
>
> On Sun, May 22, 2011 at 2:08 PM, Keith Thompson <kthomps6@binghamton.edu
> >wrote:
>
> > Hi Ted,
> >
> > Thanks for your help.  I have to learn Mahout on my own for a project I
> am
> > doing.  I thought I would just "learn by doing" using readily available
> > data
> > sets to learn how the software works (even though the data set is small).
> > Unfortunately, there doesn't seem to be any documentation that says for
> > algorithm X, Mahout requires input in format Y.  The API seems helpful
> only
> > if you already know that information.  If you know of any resources that
> > document this type of thing, I would be grateful to know what they are.
>  Of
> > course, maybe the fact that I am not a CS person doesn't help either :-)
> >
> >
> > On Sun, May 22, 2011 at 4:43 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > First step is to decide what the data is.
> > >
> > > To me it looks like you have 33 columns with integer values in the
> range
> > > from 0 through 3.  The 34th column has integers up to 75.  The 35th
> > column
> > > has integers in the range from 1 to 6.
> > >
> > > These values are either numbers or category codes.
> > >
> > > If you want to use the Naive Bayes algorithm, then they need to be
> > category
> > > codes.  To process these, you need to convert each value into a "word".
> >  My
> > > tendency would be to prefix the value with X12- where the 12 is the
> > column
> > > number.  This makes it so values in one column are not confused with
> > values
> > > in another.  For column 34, I would pick some cut points and encode
> that
> > > way
> > > (deciles or quartiles might be good).  Data can be in text form for the
> > > NaiveBayes categorizer.
> > >
> > > For the SGD categorizers, you need to code up a feature vector encoder.
> > >  Look at the FeatureValueEncoder and sub-classes for hints about this.
> >  You
> > > will need 35 encoders, one for each column.  You can probably use a
> > pretty
> > > small feature vector.
> > >
> > > This problem is very small, with only 366 data points.  As such, Mahout
> > is
> > > probably not a particularly good choice for solving your problem.
>  Mahout
> > > is
> > > optimized for cases where the training data doesn't fit into memory and
> > > uses
> > > first order methods.  WIth a small data-set like this, you can use all
> > > kinds
> > > of second-order methods to get potentially better results.
> > >
> > >
> > >
> > > On Sun, May 22, 2011 at 12:10 PM, Keith Thompson <
> > kthomps6@binghamton.edu
> > > >wrote:
> > >
> > > > If I have some numerical data (e.g., the data at
> > > >
> > > >
> > >
> >
> http://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data
> > > > )
> > > > and want to run a Mahout classification algorithm on that data, what
> > > steps
> > > > do I need to take in order to put the data into the correct input
> > format?
> > > >  I
> > > > have read that most everything requires a sequence file but I'm not
> > sure
> > > > that I still understand what that is.  Do I need to provide a key for
> > > each
> > > > row in this dataset (and the rest of the row sans the final column
> > would
> > > be
> > > > the value)?
> > > >
> > >
> >
>

Re: file input formats

Posted by Ted Dunning <te...@gmail.com>.
It is blowing my own horn to some extent, but take a look at the Mahout in
Action book.

http://www.manning.com/owen/

Also, there are several articles with examples for the Naive Bayesian
classifiers.

On Sun, May 22, 2011 at 2:08 PM, Keith Thompson <kt...@binghamton.edu>wrote:

> Hi Ted,
>
> Thanks for your help.  I have to learn Mahout on my own for a project I am
> doing.  I thought I would just "learn by doing" using readily available
> data
> sets to learn how the software works (even though the data set is small).
> Unfortunately, there doesn't seem to be any documentation that says for
> algorithm X, Mahout requires input in format Y.  The API seems helpful only
> if you already know that information.  If you know of any resources that
> document this type of thing, I would be grateful to know what they are.  Of
> course, maybe the fact that I am not a CS person doesn't help either :-)
>
>
> On Sun, May 22, 2011 at 4:43 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > First step is to decide what the data is.
> >
> > To me it looks like you have 33 columns with integer values in the range
> > from 0 through 3.  The 34th column has integers up to 75.  The 35th
> column
> > has integers in the range from 1 to 6.
> >
> > These values are either numbers or category codes.
> >
> > If you want to use the Naive Bayes algorithm, then they need to be
> category
> > codes.  To process these, you need to convert each value into a "word".
>  My
> > tendency would be to prefix the value with X12- where the 12 is the
> column
> > number.  This makes it so values in one column are not confused with
> values
> > in another.  For column 34, I would pick some cut points and encode that
> > way
> > (deciles or quartiles might be good).  Data can be in text form for the
> > NaiveBayes categorizer.
> >
> > For the SGD categorizers, you need to code up a feature vector encoder.
> >  Look at the FeatureValueEncoder and sub-classes for hints about this.
>  You
> > will need 35 encoders, one for each column.  You can probably use a
> pretty
> > small feature vector.
> >
> > This problem is very small, with only 366 data points.  As such, Mahout
> is
> > probably not a particularly good choice for solving your problem.  Mahout
> > is
> > optimized for cases where the training data doesn't fit into memory and
> > uses
> > first order methods.  WIth a small data-set like this, you can use all
> > kinds
> > of second-order methods to get potentially better results.
> >
> >
> >
> > On Sun, May 22, 2011 at 12:10 PM, Keith Thompson <
> kthomps6@binghamton.edu
> > >wrote:
> >
> > > If I have some numerical data (e.g., the data at
> > >
> > >
> >
> http://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data
> > > )
> > > and want to run a Mahout classification algorithm on that data, what
> > steps
> > > do I need to take in order to put the data into the correct input
> format?
> > >  I
> > > have read that most everything requires a sequence file but I'm not
> sure
> > > that I still understand what that is.  Do I need to provide a key for
> > each
> > > row in this dataset (and the rest of the row sans the final column
> would
> > be
> > > the value)?
> > >
> >
>

Re: file input formats

Posted by Keith Thompson <kt...@binghamton.edu>.
Hi Ted,

Thanks for your help.  I have to learn Mahout on my own for a project I am
doing.  I thought I would just "learn by doing" using readily available data
sets to learn how the software works (even though the data set is small).
Unfortunately, there doesn't seem to be any documentation that says for
algorithm X, Mahout requires input in format Y.  The API seems helpful only
if you already know that information.  If you know of any resources that
document this type of thing, I would be grateful to know what they are.  Of
course, maybe the fact that I am not a CS person doesn't help either :-)


On Sun, May 22, 2011 at 4:43 PM, Ted Dunning <te...@gmail.com> wrote:

> First step is to decide what the data is.
>
> To me it looks like you have 33 columns with integer values in the range
> from 0 through 3.  The 34th column has integers up to 75.  The 35th column
> has integers in the range from 1 to 6.
>
> These values are either numbers or category codes.
>
> If you want to use the Naive Bayes algorithm, then they need to be category
> codes.  To process these, you need to convert each value into a "word".  My
> tendency would be to prefix the value with X12- where the 12 is the column
> number.  This makes it so values in one column are not confused with values
> in another.  For column 34, I would pick some cut points and encode that
> way
> (deciles or quartiles might be good).  Data can be in text form for the
> NaiveBayes categorizer.
>
> For the SGD categorizers, you need to code up a feature vector encoder.
>  Look at the FeatureValueEncoder and sub-classes for hints about this.  You
> will need 35 encoders, one for each column.  You can probably use a pretty
> small feature vector.
>
> This problem is very small, with only 366 data points.  As such, Mahout is
> probably not a particularly good choice for solving your problem.  Mahout
> is
> optimized for cases where the training data doesn't fit into memory and
> uses
> first order methods.  WIth a small data-set like this, you can use all
> kinds
> of second-order methods to get potentially better results.
>
>
>
> On Sun, May 22, 2011 at 12:10 PM, Keith Thompson <kthomps6@binghamton.edu
> >wrote:
>
> > If I have some numerical data (e.g., the data at
> >
> >
> http://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data
> > )
> > and want to run a Mahout classification algorithm on that data, what
> steps
> > do I need to take in order to put the data into the correct input format?
> >  I
> > have read that most everything requires a sequence file but I'm not sure
> > that I still understand what that is.  Do I need to provide a key for
> each
> > row in this dataset (and the rest of the row sans the final column would
> be
> > the value)?
> >
>

Re: file input formats

Posted by Ted Dunning <te...@gmail.com>.
First step is to decide what the data is.

To me it looks like you have 33 columns with integer values in the range
from 0 through 3.  The 34th column has integers up to 75.  The 35th column
has integers in the range from 1 to 6.

These values are either numbers or category codes.

If you want to use the Naive Bayes algorithm, then they need to be category
codes.  To process these, you need to convert each value into a "word".  My
tendency would be to prefix the value with X12- where the 12 is the column
number.  This makes it so values in one column are not confused with values
in another.  For column 34, I would pick some cut points and encode that way
(deciles or quartiles might be good).  Data can be in text form for the
NaiveBayes categorizer.

For the SGD categorizers, you need to code up a feature vector encoder.
 Look at the FeatureValueEncoder and sub-classes for hints about this.  You
will need 35 encoders, one for each column.  You can probably use a pretty
small feature vector.

This problem is very small, with only 366 data points.  As such, Mahout is
probably not a particularly good choice for solving your problem.  Mahout is
optimized for cases where the training data doesn't fit into memory and uses
first order methods.  WIth a small data-set like this, you can use all kinds
of second-order methods to get potentially better results.



On Sun, May 22, 2011 at 12:10 PM, Keith Thompson <kt...@binghamton.edu>wrote:

> If I have some numerical data (e.g., the data at
>
> http://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data
> )
> and want to run a Mahout classification algorithm on that data, what steps
> do I need to take in order to put the data into the correct input format?
>  I
> have read that most everything requires a sequence file but I'm not sure
> that I still understand what that is.  Do I need to provide a key for each
> row in this dataset (and the rest of the row sans the final column would be
> the value)?
>