You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Yuan Wang <yu...@gmail.com> on 2010/01/27 04:26:53 UTC

beginner question on classification: how to build the dataset in Java code

Hi all,

I am learning Mahout. It seems to me most the examples load dataset from
files using command line. I know Baynes classifier can work with HBase.

Is there any way to build the dataset from scratch in Java Code?

for example, there is a User class having four attributes: ID(data type is
long or String), age {int}, weight (double), and diabetes {boolean} .
There are 100 user objects in my memory,  is there way I can convert them
into any type of dataset that classifier algorithm can handle.

I noticed there are vector class and InMemoryDataStore, but I don't how to
use them. If someone can give any hint or write down some pseudo code, that
would very helpful.

Thanks,
Yuan

Re: beginner question on classification: how to build the dataset in Java code

Posted by Yuan Wang <yu...@gmail.com>.
Hi Robin,

Thanks for quick response. I think I got the point. But there seems lots I/O
going on: writing the objects into files, and reading the file back into
memory. I know the current implementations focus on file as the storage. It
would be nice to have a unified model class (no matter binary format or no)
across the algorithms (classify, cluster, and CF), and there can be various
drivers transfer data from file, XML, memory, relational or no-relational
database. It will make the framework more flexible.

I understand the project is still at its early stage, and there are other
focuses. But I think the dataset is quite fundament for the framework.

Once again thank for your informative response. by the way, I am reading the
manning book you and Sean Owen working on, looking forward the future
chapters.

Yuan

On Tue, Jan 26, 2010 at 10:33 PM, Robin Anil <ro...@gmail.com> wrote:

> Hi Yuan, Bayes classifier takes only binary features. So inorder to make
> your User class into a dataset,You need to create a tab separated file with
> label as the key and space separated features as the value. Presence of a
> feature makes it true absence makes it false.
>
> e.g.  if you are classifying heart-attack prone v/s healthy
> individual(assuming from your data)
> take two labels heart-attack and healthy
>
> You will need to convert integer and double values and map them to boolean
> features
> say you have boolean features like
>
> Weight:40-50
> Weight:50-60
>
> Age:20-30
> Age:30-40
>
> For user A with age = 23 weight = 53 diabetes=false
> write the line
>
> healthy<TAB>Age:20-30 Weight:50-60
>
> For user B with age = 37 weight = 52 diabetes=true
>
> heart-attack<TAB>Age:30-40 Weight:50-60 diabetes
>
> You will have many such lines for each feature in your dataset file. Give
> the file path to the classifier and it learns the model for you.
>
> For now, the algorithm takes the data from a file and not from a memory
> datastructure and do not use vectors. Try the classification
> example(20newsgroups) to get an idea of how the classifier can be run
>
> Robin
>
> On Wed, Jan 27, 2010 at 8:56 AM, Yuan Wang <yu...@gmail.com> wrote:
>
> > Hi all,
> >
> > I am learning Mahout. It seems to me most the examples load dataset from
> > files using command line. I know Baynes classifier can work with HBase.
> >
> > Is there any way to build the dataset from scratch in Java Code?
> >
> > for example, there is a User class having four attributes: ID(data type
> is
> > long or String), age {int}, weight (double), and diabetes {boolean} .
> > There are 100 user objects in my memory,  is there way I can convert them
> > into any type of dataset that classifier algorithm can handle.
> >
> > I noticed there are vector class and InMemoryDataStore, but I don't how
> to
> > use them. If someone can give any hint or write down some pseudo code,
> that
> > would very helpful.
> >
> > Thanks,
> > Yuan
> >
>

Re: beginner question on classification: how to build the dataset in Java code

Posted by Robin Anil <ro...@gmail.com>.
Hi Yuan, Bayes classifier takes only binary features. So inorder to make
your User class into a dataset,You need to create a tab separated file with
label as the key and space separated features as the value. Presence of a
feature makes it true absence makes it false.

e.g.  if you are classifying heart-attack prone v/s healthy
individual(assuming from your data)
take two labels heart-attack and healthy

You will need to convert integer and double values and map them to boolean
features
say you have boolean features like

Weight:40-50
Weight:50-60

Age:20-30
Age:30-40

For user A with age = 23 weight = 53 diabetes=false
write the line

healthy<TAB>Age:20-30 Weight:50-60

For user B with age = 37 weight = 52 diabetes=true

heart-attack<TAB>Age:30-40 Weight:50-60 diabetes

You will have many such lines for each feature in your dataset file. Give
the file path to the classifier and it learns the model for you.

For now, the algorithm takes the data from a file and not from a memory
datastructure and do not use vectors. Try the classification
example(20newsgroups) to get an idea of how the classifier can be run

Robin

On Wed, Jan 27, 2010 at 8:56 AM, Yuan Wang <yu...@gmail.com> wrote:

> Hi all,
>
> I am learning Mahout. It seems to me most the examples load dataset from
> files using command line. I know Baynes classifier can work with HBase.
>
> Is there any way to build the dataset from scratch in Java Code?
>
> for example, there is a User class having four attributes: ID(data type is
> long or String), age {int}, weight (double), and diabetes {boolean} .
> There are 100 user objects in my memory,  is there way I can convert them
> into any type of dataset that classifier algorithm can handle.
>
> I noticed there are vector class and InMemoryDataStore, but I don't how to
> use them. If someone can give any hint or write down some pseudo code, that
> would very helpful.
>
> Thanks,
> Yuan
>