You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "j.barrett Strausser" <j....@gmail.com> on 2014/02/03 22:01:50 UTC

Data(Set) creation of for train and test.

Two part question.

1. String Descriptor for input data

Can anyone confirm my reasoning on the following -

I believe the below code does the following.  It says the first column is
the feature to be predicted (is a label) all other columns are to be used
in the tree construction e.g. as variable to split on.

val descriptor = "L N N"
val trainDataValues = fileAsStringArray("myTrainFile.csv");
val data = DataLoader.loadData(DataLoader.generateDataset(descriptor,
false, trainDataValues), trainDataValues);

Where my "myTrainFile.csv has a form like

"A", .45,.55
...
...
"B" 33.3, 22.3



2. String Descriptor for input data

I'm now provided a new file "myTestData.csv"

This data has no labels, but is otherwise the same as above. So if I
attempt to create a dataset an error will be thrown with complain of no
label.

All I'm interested in is being able to call forest.classify(..., ...) but
I'm not sure how to correctly construct my training dataset.

I cannot simply split the original dataset as is done in most examples.


Any examples showing test data construction independent of the original
training set would be appreciated.


-- 


https://github.com/bearrito
@deepbearrito

Re: Data(Set) creation of for train and test.

Posted by Frank Scholten <fr...@frankscholten.nl>.

Sorry I didn't properly read your message. The random forest code is quite
different and what I suggested is not applicable.

The DataConverter converts a String to a Vector wrapped by Instance. With
this you can create your training set I think.



On Mon, Feb 3, 2014 at 10:09 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> Have a look at OnlineLogisticRegressionTest.iris().
>
> Here List.subList() is used in combination with Collections.shuffle() to
> make the train and test dataset split.
>
> So you could first read the dataset in a list and then use this trick.
>
> I just pushed an example to Github that also uses this approach but I
> wrapped this logic into a utility
>
> See: https://github.com/frankscholten/mahout-sgd-bank-marketing and
>
>
> https://github.com/frankscholten/mahout-sgd-bank-marketing/blob/master/src/main/java/bankmarketing/util/TrainAndTestSetUtil.java
>
> Cheers,
>
> Frank
>
>
> On Mon, Feb 3, 2014 at 10:01 PM, j.barrett Strausser <
> j.barrett.strausser@gmail.com> wrote:
>
>> Two part question.
>>
>> 1. String Descriptor for input data
>>
>> Can anyone confirm my reasoning on the following -
>>
>> I believe the below code does the following.  It says the first column is
>> the feature to be predicted (is a label) all other columns are to be used
>> in the tree construction e.g. as variable to split on.
>>
>> val descriptor = "L N N"
>> val trainDataValues = fileAsStringArray("myTrainFile.csv");
>> val data = DataLoader.loadData(DataLoader.generateDataset(descriptor,
>> false, trainDataValues), trainDataValues);
>>
>> Where my "myTrainFile.csv has a form like
>>
>> "A", .45,.55
>> ...
>> ...
>> "B" 33.3, 22.3
>>
>>
>>
>> 2. String Descriptor for input data
>>
>> I'm now provided a new file "myTestData.csv"
>>
>> This data has no labels, but is otherwise the same as above. So if I
>> attempt to create a dataset an error will be thrown with complain of no
>> label.
>>
>> All I'm interested in is being able to call forest.classify(..., ...) but
>> I'm not sure how to correctly construct my training dataset.
>>
>> I cannot simply split the original dataset as is done in most examples.
>>
>>
>> Any examples showing test data construction independent of the original
>> training set would be appreciated.
>>
>>
>> --
>>
>>
>> https://github.com/bearrito
>> @deepbearrito
>>
>
>

Re: Data(Set) creation of for train and test.

Posted by Frank Scholten <fr...@frankscholten.nl>.

Have a look at OnlineLogisticRegressionTest.iris().

Here List.subList() is used in combination with Collections.shuffle() to
make the train and test dataset split.

So you could first read the dataset in a list and then use this trick.

I just pushed an example to Github that also uses this approach but I
wrapped this logic into a utility

See: https://github.com/frankscholten/mahout-sgd-bank-marketing and

https://github.com/frankscholten/mahout-sgd-bank-marketing/blob/master/src/main/java/bankmarketing/util/TrainAndTestSetUtil.java

Cheers,

Frank


On Mon, Feb 3, 2014 at 10:01 PM, j.barrett Strausser <
j.barrett.strausser@gmail.com> wrote:

> Two part question.
>
> 1. String Descriptor for input data
>
> Can anyone confirm my reasoning on the following -
>
> I believe the below code does the following.  It says the first column is
> the feature to be predicted (is a label) all other columns are to be used
> in the tree construction e.g. as variable to split on.
>
> val descriptor = "L N N"
> val trainDataValues = fileAsStringArray("myTrainFile.csv");
> val data = DataLoader.loadData(DataLoader.generateDataset(descriptor,
> false, trainDataValues), trainDataValues);
>
> Where my "myTrainFile.csv has a form like
>
> "A", .45,.55
> ...
> ...
> "B" 33.3, 22.3
>
>
>
> 2. String Descriptor for input data
>
> I'm now provided a new file "myTestData.csv"
>
> This data has no labels, but is otherwise the same as above. So if I
> attempt to create a dataset an error will be thrown with complain of no
> label.
>
> All I'm interested in is being able to call forest.classify(..., ...) but
> I'm not sure how to correctly construct my training dataset.
>
> I cannot simply split the original dataset as is done in most examples.
>
>
> Any examples showing test data construction independent of the original
> training set would be appreciated.
>
>
> --
>
>
> https://github.com/bearrito
> @deepbearrito
>