You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ranjitha Chandrashekar <Ra...@hcl.com> on 2013/01/17 12:15:41 UTC

Issue with Partial Implementation Problem

Hi

I am using Partial Implementation for Random Forest classification.

I have a training dataset with labels class0, class 1, class 2.  The decision forest is built on this training dataset.  The classification for the test dataset is computed using the same data descriptor generated for the training dataset.  I am able to generate confusion matrix, accuracy details with the test data set with class variable.

However I also need to make a classification for a scenario, where test data may not have the class variable or class values are not known.  For ex, assume test data is about future data points, for which class values will have to be computed only in the future.


*         How is it possible to classify the test data set, where the class label is not defined or not known. I have tried using default labels like "unknown", "NO_LABEL". It doesnt seem to work.


*         How to set the class label as "unknown" in the testing dataset.

Looking forward to your reply,

Thanks
Ranjitha.



::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------

Re: Issue with Partial Implementation Problem

Posted by deneche abdelhakim <ad...@gmail.com>.

Glad it worked.


On Tue, Jan 22, 2013 at 7:20 AM, Ranjitha Chandrashekar <Ranjitha.Ch@hcl.com
> wrote:

> Hi Deneche,
>
> The patch is working perfect! The classifier now output the label string
> values instead of the
> numerical codes.
>
> Thank you for fixing the issue.
>
> Regards
> Ranjitha
>
> -----Original Message-----
> From: deneche abdelhakim [mailto:adeneche@gmail.com]
> Sent: 18 January 2013 21:14
> To: user@mahout.apache.org
> Subject: Re: Issue with Partial Implementation Problem
>
> I submitted a patch, you can give it a try and let me know if it fixes the
> problem.
>
> https://issues.apache.org/jira/browse/MAHOUT-1143
>
> The classifier should now output the label string values instead of the
> numerical codes.
>
>
> On Fri, Jan 18, 2013 at 4:26 PM, deneche abdelhakim <adeneche@gmail.com
> >wrote:
>
> > Hi Ranjitha,
> >
> > I created a JIRA issue to fix this, and should submit a patch soon.
> >
> >
> > On Fri, Jan 18, 2013 at 10:29 AM, Ranjitha Chandrashekar <
> > Ranjitha.Ch@hcl.com> wrote:
> >
> >> Hi Deneche,
> >>
> >> Thanks. As suggested, I replaced the label value as "normal" in KDDTest
> >> dataset and tested the forest without -a option.
> >> It generates a binary file(.out file) with values 0 and 1.
> >>
> >> In order to interpret this I have gone through the code and hence
> >> understand that MR job (Classifier.CMapper) generates a file with Key ->
> >> Correct Label and Value -> Prediction. Then it creates a new file with
> .out
> >> extension which only contains Values i.e. Prediction(0 or 1) in my case
> and
> >> then it deletes the previous file generated by the MR job. Hence I do
> not
> >> have access to the file generated by MR job which contains Correct Label
> >> and Prediction for each input Test record
> >>
> >> After looking at these predictions I am not sure what 0 and 1 actually
> >> means . Does 1 mean its classified correctly..? "normal" in this case
> and 0
> >> means the classification is wrong and should be "anamoly"?
> >>
> >> Please Suggest
> >>
> >> Regards
> >> Ranjitha
> >>
> >> -----Original Message-----
> >> From: deneche abdelhakim [mailto:adeneche@gmail.com]
> >> Sent: 18 January 2013 12:21
> >> To: user@mahout.apache.org
> >> Subject: Re: Issue with Partial Implementation Problem
> >>
> >> My mistake. You should put any label value available in the training
> set.
> >> In the previous example, putting "normal" in all test record should be
> >> fine.
> >>
> >>
> >> On Fri, Jan 18, 2013 at 7:26 AM, Ranjitha Chandrashekar <
> >> Ranjitha.Ch@hcl.com
> >> > wrote:
> >>
> >> > Hi Deneche
> >> >
> >> > Thank you for your quick response.
> >> >
> >> > I tried using the numerical value in the label attribute in the test
> >> data.
> >> >
> >> > Original Record in KDDTest :
> >> >
> >>
> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,normal
> >> >
> >> > Replaced Record :
> >> >
> >> >
> >>
> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,1
> >> >
> >> > (normal class replaced with numerical value 1)
> >> >
> >> > Ran TestForest on KDDTest dataset. Following is the error that i get.
> >> > Sequential and map reduce classification gives the same error.
> >> >
> >> > Command --> hadoop jar
> >> > /usr/lib/mahout-0.5/mahout-examples-0.5-cdh3u5-job.jar
> >> > org.apache.mahout.df.mapreduce.TestForest -i
> >> > /user/ranjitha/input/KDDTest+.arff.txt_withnum -ds
> >> > /user/ranjitha/input/KDDTrain+.info -m /user/ranjitha/KDDForest -o
> >> > /user/ranjitha/KDDResult
> >> >
> >> > 13/01/18 11:29:24 INFO mapreduce.TestForest: Loading the forest...
> >> > 13/01/18 11:29:24 INFO mapreduce.TestForest: Sequential
> >> classification...
> >> > 13/01/18 11:29:24 ERROR data.DataConverter: label token: 1
> >> dataset.labels:
> >> > [normal, anomaly] Exception in thread "main"
> >> > java.lang.IllegalStateException: Label value (1) not known
> >> >         at
> >> > org.apache.mahout.df.data.DataConverter.convert(DataConverter.java:71)
> >> >         at
> >> >
> org.apache.mahout.df.mapreduce.TestForest.testFile(TestForest.java:256)
> >> >         at
> >> >
> >>
> org.apache.mahout.df.mapreduce.TestForest.sequential(TestForest.java:216)
> >> >         at
> >> >
> >>
> org.apache.mahout.df.mapreduce.TestForest.testForest(TestForest.java:172)
> >> >         at
> >> > org.apache.mahout.df.mapreduce.TestForest.run(TestForest.java:142)
> >> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> >         at
> >> > org.apache.mahout.df.mapreduce.TestForest.main(TestForest.java:275)
> >> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> >         at
> >> >
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >> >         at
> >> >
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> >         at java.lang.reflect.Method.invoke(Method.java:616)
> >> >         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >> >
> >> > Looking forward to your reply
> >> >
> >> > Thanks
> >> > Ranjitha.
> >> >
> >> > -----Original Message-----
> >> > From: deneche abdelhakim [mailto:adeneche@gmail.com]
> >> > Sent: 17 January 2013 18:20
> >> > To: user@mahout.apache.org
> >> > Subject: Re: Issue with Partial Implementation Problem
> >> >
> >> > Hi Ranjitha,
> >> >
> >> > just put any numerical value in the label attribute. You should be
> able
> >> to
> >> > classify the data, but you won't be able to compute the confusion
> >> matrix or
> >> > the accuracy.
> >> >
> >> >
> >> > On Thu, Jan 17, 2013 at 12:15 PM, Ranjitha Chandrashekar <
> >> > Ranjitha.Ch@hcl.com> wrote:
> >> >
> >> > > Hi
> >> > >
> >> > > I am using Partial Implementation for Random Forest classification.
> >> > >
> >> > > I have a training dataset with labels class0, class 1, class 2.  The
> >> > > decision forest is built on this training dataset.  The
> classification
> >> > for
> >> > > the test dataset is computed using the same data descriptor
> generated
> >> for
> >> > > the training dataset.  I am able to generate confusion matrix,
> >> accuracy
> >> > > details with the test data set with class variable.
> >> > >
> >> > > However I also need to make a classification for a scenario, where
> >> test
> >> > > data may not have the class variable or class values are not known.
> >>  For
> >> > > ex, assume test data is about future data points, for which class
> >> values
> >> > > will have to be computed only in the future.
> >> > >
> >> > >
> >> > > *         How is it possible to classify the test data set, where
> the
> >> > > class label is not defined or not known. I have tried using default
> >> > labels
> >> > > like "unknown", "NO_LABEL". It doesnt seem to work.
> >> > >
> >> > >
> >> > > *         How to set the class label as "unknown" in the testing
> >> dataset.
> >> > >
> >> > > Looking forward to your reply,
> >> > >
> >> > > Thanks
> >> > > Ranjitha.
> >> > >
> >> > >
> >> > >
> >> > > ::DISCLAIMER::
> >> > >
> >> > >
> >> >
> >>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> >> > >
> >> > > The contents of this e-mail and any attachment(s) are confidential
> and
> >> > > intended for the named recipient(s) only.
> >> > > E-mail transmission is not guaranteed to be secure or error-free as
> >> > > information could be intercepted, corrupted,
> >> > > lost, destroyed, arrive late or incomplete, or may contain viruses
> in
> >> > > transmission. The e mail and its contents
> >> > > (with or without referred errors) shall therefore not attach any
> >> > liability
> >> > > on the originator or HCL or its affiliates.
> >> > > Views or opinions, if any, presented in this email are solely those
> of
> >> > the
> >> > > author and may not necessarily reflect the
> >> > > views or opinions of HCL or its affiliates. Any form of
> reproduction,
> >> > > dissemination, copying, disclosure, modification,
> >> > > distribution and / or publication of this message without the prior
> >> > > written consent of authorized representative of
> >> > > HCL is strictly prohibited. If you have received this email in error
> >> > > please delete it and notify the sender immediately.
> >> > > Before opening any email and/or attachments, please check them for
> >> > viruses
> >> > > and other defects.
> >> > >
> >> > >
> >> > >
> >> >
> >>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> >> > >
> >> >
> >>
> >
> >
>

RE: Issue with Partial Implementation Problem

Posted by Ranjitha Chandrashekar <Ra...@hcl.com>.

Hi Deneche,

The patch is working perfect! The classifier now output the label string values instead of the
numerical codes.

Thank you for fixing the issue.

Regards
Ranjitha

-----Original Message-----
From: deneche abdelhakim [mailto:adeneche@gmail.com] 
Sent: 18 January 2013 21:14
To: user@mahout.apache.org
Subject: Re: Issue with Partial Implementation Problem

I submitted a patch, you can give it a try and let me know if it fixes the
problem.

https://issues.apache.org/jira/browse/MAHOUT-1143

The classifier should now output the label string values instead of the
numerical codes.


On Fri, Jan 18, 2013 at 4:26 PM, deneche abdelhakim <ad...@gmail.com>wrote:

> Hi Ranjitha,
>
> I created a JIRA issue to fix this, and should submit a patch soon.
>
>
> On Fri, Jan 18, 2013 at 10:29 AM, Ranjitha Chandrashekar <
> Ranjitha.Ch@hcl.com> wrote:
>
>> Hi Deneche,
>>
>> Thanks. As suggested, I replaced the label value as "normal" in KDDTest
>> dataset and tested the forest without -a option.
>> It generates a binary file(.out file) with values 0 and 1.
>>
>> In order to interpret this I have gone through the code and hence
>> understand that MR job (Classifier.CMapper) generates a file with Key ->
>> Correct Label and Value -> Prediction. Then it creates a new file with .out
>> extension which only contains Values i.e. Prediction(0 or 1) in my case and
>> then it deletes the previous file generated by the MR job. Hence I do not
>> have access to the file generated by MR job which contains Correct Label
>> and Prediction for each input Test record
>>
>> After looking at these predictions I am not sure what 0 and 1 actually
>> means . Does 1 mean its classified correctly..? "normal" in this case and 0
>> means the classification is wrong and should be "anamoly"?
>>
>> Please Suggest
>>
>> Regards
>> Ranjitha
>>
>> -----Original Message-----
>> From: deneche abdelhakim [mailto:adeneche@gmail.com]
>> Sent: 18 January 2013 12:21
>> To: user@mahout.apache.org
>> Subject: Re: Issue with Partial Implementation Problem
>>
>> My mistake. You should put any label value available in the training set.
>> In the previous example, putting "normal" in all test record should be
>> fine.
>>
>>
>> On Fri, Jan 18, 2013 at 7:26 AM, Ranjitha Chandrashekar <
>> Ranjitha.Ch@hcl.com
>> > wrote:
>>
>> > Hi Deneche
>> >
>> > Thank you for your quick response.
>> >
>> > I tried using the numerical value in the label attribute in the test
>> data.
>> >
>> > Original Record in KDDTest :
>> >
>> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,normal
>> >
>> > Replaced Record :
>> >
>> >
>> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,1
>> >
>> > (normal class replaced with numerical value 1)
>> >
>> > Ran TestForest on KDDTest dataset. Following is the error that i get.
>> > Sequential and map reduce classification gives the same error.
>> >
>> > Command --> hadoop jar
>> > /usr/lib/mahout-0.5/mahout-examples-0.5-cdh3u5-job.jar
>> > org.apache.mahout.df.mapreduce.TestForest -i
>> > /user/ranjitha/input/KDDTest+.arff.txt_withnum -ds
>> > /user/ranjitha/input/KDDTrain+.info -m /user/ranjitha/KDDForest -o
>> > /user/ranjitha/KDDResult
>> >
>> > 13/01/18 11:29:24 INFO mapreduce.TestForest: Loading the forest...
>> > 13/01/18 11:29:24 INFO mapreduce.TestForest: Sequential
>> classification...
>> > 13/01/18 11:29:24 ERROR data.DataConverter: label token: 1
>> dataset.labels:
>> > [normal, anomaly] Exception in thread "main"
>> > java.lang.IllegalStateException: Label value (1) not known
>> >         at
>> > org.apache.mahout.df.data.DataConverter.convert(DataConverter.java:71)
>> >         at
>> > org.apache.mahout.df.mapreduce.TestForest.testFile(TestForest.java:256)
>> >         at
>> >
>> org.apache.mahout.df.mapreduce.TestForest.sequential(TestForest.java:216)
>> >         at
>> >
>> org.apache.mahout.df.mapreduce.TestForest.testForest(TestForest.java:172)
>> >         at
>> > org.apache.mahout.df.mapreduce.TestForest.run(TestForest.java:142)
>> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >         at
>> > org.apache.mahout.df.mapreduce.TestForest.main(TestForest.java:275)
>> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >         at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> >         at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >         at java.lang.reflect.Method.invoke(Method.java:616)
>> >         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>> >
>> > Looking forward to your reply
>> >
>> > Thanks
>> > Ranjitha.
>> >
>> > -----Original Message-----
>> > From: deneche abdelhakim [mailto:adeneche@gmail.com]
>> > Sent: 17 January 2013 18:20
>> > To: user@mahout.apache.org
>> > Subject: Re: Issue with Partial Implementation Problem
>> >
>> > Hi Ranjitha,
>> >
>> > just put any numerical value in the label attribute. You should be able
>> to
>> > classify the data, but you won't be able to compute the confusion
>> matrix or
>> > the accuracy.
>> >
>> >
>> > On Thu, Jan 17, 2013 at 12:15 PM, Ranjitha Chandrashekar <
>> > Ranjitha.Ch@hcl.com> wrote:
>> >
>> > > Hi
>> > >
>> > > I am using Partial Implementation for Random Forest classification.
>> > >
>> > > I have a training dataset with labels class0, class 1, class 2.  The
>> > > decision forest is built on this training dataset.  The classification
>> > for
>> > > the test dataset is computed using the same data descriptor generated
>> for
>> > > the training dataset.  I am able to generate confusion matrix,
>> accuracy
>> > > details with the test data set with class variable.
>> > >
>> > > However I also need to make a classification for a scenario, where
>> test
>> > > data may not have the class variable or class values are not known.
>>  For
>> > > ex, assume test data is about future data points, for which class
>> values
>> > > will have to be computed only in the future.
>> > >
>> > >
>> > > *         How is it possible to classify the test data set, where the
>> > > class label is not defined or not known. I have tried using default
>> > labels
>> > > like "unknown", "NO_LABEL". It doesnt seem to work.
>> > >
>> > >
>> > > *         How to set the class label as "unknown" in the testing
>> dataset.
>> > >
>> > > Looking forward to your reply,
>> > >
>> > > Thanks
>> > > Ranjitha.
>> > >
>> > >
>> > >
>> > > ::DISCLAIMER::
>> > >
>> > >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> > >
>> > > The contents of this e-mail and any attachment(s) are confidential and
>> > > intended for the named recipient(s) only.
>> > > E-mail transmission is not guaranteed to be secure or error-free as
>> > > information could be intercepted, corrupted,
>> > > lost, destroyed, arrive late or incomplete, or may contain viruses in
>> > > transmission. The e mail and its contents
>> > > (with or without referred errors) shall therefore not attach any
>> > liability
>> > > on the originator or HCL or its affiliates.
>> > > Views or opinions, if any, presented in this email are solely those of
>> > the
>> > > author and may not necessarily reflect the
>> > > views or opinions of HCL or its affiliates. Any form of reproduction,
>> > > dissemination, copying, disclosure, modification,
>> > > distribution and / or publication of this message without the prior
>> > > written consent of authorized representative of
>> > > HCL is strictly prohibited. If you have received this email in error
>> > > please delete it and notify the sender immediately.
>> > > Before opening any email and/or attachments, please check them for
>> > viruses
>> > > and other defects.
>> > >
>> > >
>> > >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> > >
>> >
>>
>
>

Re: Issue with Partial Implementation Problem

Posted by deneche abdelhakim <ad...@gmail.com>.

I submitted a patch, you can give it a try and let me know if it fixes the
problem.

https://issues.apache.org/jira/browse/MAHOUT-1143

The classifier should now output the label string values instead of the
numerical codes.


On Fri, Jan 18, 2013 at 4:26 PM, deneche abdelhakim <ad...@gmail.com>wrote:

> Hi Ranjitha,
>
> I created a JIRA issue to fix this, and should submit a patch soon.
>
>
> On Fri, Jan 18, 2013 at 10:29 AM, Ranjitha Chandrashekar <
> Ranjitha.Ch@hcl.com> wrote:
>
>> Hi Deneche,
>>
>> Thanks. As suggested, I replaced the label value as "normal" in KDDTest
>> dataset and tested the forest without -a option.
>> It generates a binary file(.out file) with values 0 and 1.
>>
>> In order to interpret this I have gone through the code and hence
>> understand that MR job (Classifier.CMapper) generates a file with Key ->
>> Correct Label and Value -> Prediction. Then it creates a new file with .out
>> extension which only contains Values i.e. Prediction(0 or 1) in my case and
>> then it deletes the previous file generated by the MR job. Hence I do not
>> have access to the file generated by MR job which contains Correct Label
>> and Prediction for each input Test record
>>
>> After looking at these predictions I am not sure what 0 and 1 actually
>> means . Does 1 mean its classified correctly..? "normal" in this case and 0
>> means the classification is wrong and should be "anamoly"?
>>
>> Please Suggest
>>
>> Regards
>> Ranjitha
>>
>> -----Original Message-----
>> From: deneche abdelhakim [mailto:adeneche@gmail.com]
>> Sent: 18 January 2013 12:21
>> To: user@mahout.apache.org
>> Subject: Re: Issue with Partial Implementation Problem
>>
>> My mistake. You should put any label value available in the training set.
>> In the previous example, putting "normal" in all test record should be
>> fine.
>>
>>
>> On Fri, Jan 18, 2013 at 7:26 AM, Ranjitha Chandrashekar <
>> Ranjitha.Ch@hcl.com
>> > wrote:
>>
>> > Hi Deneche
>> >
>> > Thank you for your quick response.
>> >
>> > I tried using the numerical value in the label attribute in the test
>> data.
>> >
>> > Original Record in KDDTest :
>> >
>> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,normal
>> >
>> > Replaced Record :
>> >
>> >
>> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,1
>> >
>> > (normal class replaced with numerical value 1)
>> >
>> > Ran TestForest on KDDTest dataset. Following is the error that i get.
>> > Sequential and map reduce classification gives the same error.
>> >
>> > Command --> hadoop jar
>> > /usr/lib/mahout-0.5/mahout-examples-0.5-cdh3u5-job.jar
>> > org.apache.mahout.df.mapreduce.TestForest -i
>> > /user/ranjitha/input/KDDTest+.arff.txt_withnum -ds
>> > /user/ranjitha/input/KDDTrain+.info -m /user/ranjitha/KDDForest -o
>> > /user/ranjitha/KDDResult
>> >
>> > 13/01/18 11:29:24 INFO mapreduce.TestForest: Loading the forest...
>> > 13/01/18 11:29:24 INFO mapreduce.TestForest: Sequential
>> classification...
>> > 13/01/18 11:29:24 ERROR data.DataConverter: label token: 1
>> dataset.labels:
>> > [normal, anomaly] Exception in thread "main"
>> > java.lang.IllegalStateException: Label value (1) not known
>> >         at
>> > org.apache.mahout.df.data.DataConverter.convert(DataConverter.java:71)
>> >         at
>> > org.apache.mahout.df.mapreduce.TestForest.testFile(TestForest.java:256)
>> >         at
>> >
>> org.apache.mahout.df.mapreduce.TestForest.sequential(TestForest.java:216)
>> >         at
>> >
>> org.apache.mahout.df.mapreduce.TestForest.testForest(TestForest.java:172)
>> >         at
>> > org.apache.mahout.df.mapreduce.TestForest.run(TestForest.java:142)
>> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >         at
>> > org.apache.mahout.df.mapreduce.TestForest.main(TestForest.java:275)
>> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >         at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> >         at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >         at java.lang.reflect.Method.invoke(Method.java:616)
>> >         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>> >
>> > Looking forward to your reply
>> >
>> > Thanks
>> > Ranjitha.
>> >
>> > -----Original Message-----
>> > From: deneche abdelhakim [mailto:adeneche@gmail.com]
>> > Sent: 17 January 2013 18:20
>> > To: user@mahout.apache.org
>> > Subject: Re: Issue with Partial Implementation Problem
>> >
>> > Hi Ranjitha,
>> >
>> > just put any numerical value in the label attribute. You should be able
>> to
>> > classify the data, but you won't be able to compute the confusion
>> matrix or
>> > the accuracy.
>> >
>> >
>> > On Thu, Jan 17, 2013 at 12:15 PM, Ranjitha Chandrashekar <
>> > Ranjitha.Ch@hcl.com> wrote:
>> >
>> > > Hi
>> > >
>> > > I am using Partial Implementation for Random Forest classification.
>> > >
>> > > I have a training dataset with labels class0, class 1, class 2.  The
>> > > decision forest is built on this training dataset.  The classification
>> > for
>> > > the test dataset is computed using the same data descriptor generated
>> for
>> > > the training dataset.  I am able to generate confusion matrix,
>> accuracy
>> > > details with the test data set with class variable.
>> > >
>> > > However I also need to make a classification for a scenario, where
>> test
>> > > data may not have the class variable or class values are not known.
>>  For
>> > > ex, assume test data is about future data points, for which class
>> values
>> > > will have to be computed only in the future.
>> > >
>> > >
>> > > *         How is it possible to classify the test data set, where the
>> > > class label is not defined or not known. I have tried using default
>> > labels
>> > > like "unknown", "NO_LABEL". It doesnt seem to work.
>> > >
>> > >
>> > > *         How to set the class label as "unknown" in the testing
>> dataset.
>> > >
>> > > Looking forward to your reply,
>> > >
>> > > Thanks
>> > > Ranjitha.
>> > >
>> > >
>> > >
>> > > ::DISCLAIMER::
>> > >
>> > >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> > >
>> > > The contents of this e-mail and any attachment(s) are confidential and
>> > > intended for the named recipient(s) only.
>> > > E-mail transmission is not guaranteed to be secure or error-free as
>> > > information could be intercepted, corrupted,
>> > > lost, destroyed, arrive late or incomplete, or may contain viruses in
>> > > transmission. The e mail and its contents
>> > > (with or without referred errors) shall therefore not attach any
>> > liability
>> > > on the originator or HCL or its affiliates.
>> > > Views or opinions, if any, presented in this email are solely those of
>> > the
>> > > author and may not necessarily reflect the
>> > > views or opinions of HCL or its affiliates. Any form of reproduction,
>> > > dissemination, copying, disclosure, modification,
>> > > distribution and / or publication of this message without the prior
>> > > written consent of authorized representative of
>> > > HCL is strictly prohibited. If you have received this email in error
>> > > please delete it and notify the sender immediately.
>> > > Before opening any email and/or attachments, please check them for
>> > viruses
>> > > and other defects.
>> > >
>> > >
>> > >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> > >
>> >
>>
>
>

Re: Issue with Partial Implementation Problem

Posted by deneche abdelhakim <ad...@gmail.com>.

Hi Ranjitha,

I created a JIRA issue to fix this, and should submit a patch soon.


On Fri, Jan 18, 2013 at 10:29 AM, Ranjitha Chandrashekar <
Ranjitha.Ch@hcl.com> wrote:

> Hi Deneche,
>
> Thanks. As suggested, I replaced the label value as "normal" in KDDTest
> dataset and tested the forest without -a option.
> It generates a binary file(.out file) with values 0 and 1.
>
> In order to interpret this I have gone through the code and hence
> understand that MR job (Classifier.CMapper) generates a file with Key ->
> Correct Label and Value -> Prediction. Then it creates a new file with .out
> extension which only contains Values i.e. Prediction(0 or 1) in my case and
> then it deletes the previous file generated by the MR job. Hence I do not
> have access to the file generated by MR job which contains Correct Label
> and Prediction for each input Test record
>
> After looking at these predictions I am not sure what 0 and 1 actually
> means . Does 1 mean its classified correctly..? "normal" in this case and 0
> means the classification is wrong and should be "anamoly"?
>
> Please Suggest
>
> Regards
> Ranjitha
>
> -----Original Message-----
> From: deneche abdelhakim [mailto:adeneche@gmail.com]
> Sent: 18 January 2013 12:21
> To: user@mahout.apache.org
> Subject: Re: Issue with Partial Implementation Problem
>
> My mistake. You should put any label value available in the training set.
> In the previous example, putting "normal" in all test record should be
> fine.
>
>
> On Fri, Jan 18, 2013 at 7:26 AM, Ranjitha Chandrashekar <
> Ranjitha.Ch@hcl.com
> > wrote:
>
> > Hi Deneche
> >
> > Thank you for your quick response.
> >
> > I tried using the numerical value in the label attribute in the test
> data.
> >
> > Original Record in KDDTest :
> >
> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,normal
> >
> > Replaced Record :
> >
> >
> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,1
> >
> > (normal class replaced with numerical value 1)
> >
> > Ran TestForest on KDDTest dataset. Following is the error that i get.
> > Sequential and map reduce classification gives the same error.
> >
> > Command --> hadoop jar
> > /usr/lib/mahout-0.5/mahout-examples-0.5-cdh3u5-job.jar
> > org.apache.mahout.df.mapreduce.TestForest -i
> > /user/ranjitha/input/KDDTest+.arff.txt_withnum -ds
> > /user/ranjitha/input/KDDTrain+.info -m /user/ranjitha/KDDForest -o
> > /user/ranjitha/KDDResult
> >
> > 13/01/18 11:29:24 INFO mapreduce.TestForest: Loading the forest...
> > 13/01/18 11:29:24 INFO mapreduce.TestForest: Sequential classification...
> > 13/01/18 11:29:24 ERROR data.DataConverter: label token: 1
> dataset.labels:
> > [normal, anomaly] Exception in thread "main"
> > java.lang.IllegalStateException: Label value (1) not known
> >         at
> > org.apache.mahout.df.data.DataConverter.convert(DataConverter.java:71)
> >         at
> > org.apache.mahout.df.mapreduce.TestForest.testFile(TestForest.java:256)
> >         at
> > org.apache.mahout.df.mapreduce.TestForest.sequential(TestForest.java:216)
> >         at
> > org.apache.mahout.df.mapreduce.TestForest.testForest(TestForest.java:172)
> >         at
> > org.apache.mahout.df.mapreduce.TestForest.run(TestForest.java:142)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >         at
> > org.apache.mahout.df.mapreduce.TestForest.main(TestForest.java:275)
> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >         at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >         at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >         at java.lang.reflect.Method.invoke(Method.java:616)
> >         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >
> > Looking forward to your reply
> >
> > Thanks
> > Ranjitha.
> >
> > -----Original Message-----
> > From: deneche abdelhakim [mailto:adeneche@gmail.com]
> > Sent: 17 January 2013 18:20
> > To: user@mahout.apache.org
> > Subject: Re: Issue with Partial Implementation Problem
> >
> > Hi Ranjitha,
> >
> > just put any numerical value in the label attribute. You should be able
> to
> > classify the data, but you won't be able to compute the confusion matrix
> or
> > the accuracy.
> >
> >
> > On Thu, Jan 17, 2013 at 12:15 PM, Ranjitha Chandrashekar <
> > Ranjitha.Ch@hcl.com> wrote:
> >
> > > Hi
> > >
> > > I am using Partial Implementation for Random Forest classification.
> > >
> > > I have a training dataset with labels class0, class 1, class 2.  The
> > > decision forest is built on this training dataset.  The classification
> > for
> > > the test dataset is computed using the same data descriptor generated
> for
> > > the training dataset.  I am able to generate confusion matrix, accuracy
> > > details with the test data set with class variable.
> > >
> > > However I also need to make a classification for a scenario, where test
> > > data may not have the class variable or class values are not known.
>  For
> > > ex, assume test data is about future data points, for which class
> values
> > > will have to be computed only in the future.
> > >
> > >
> > > *         How is it possible to classify the test data set, where the
> > > class label is not defined or not known. I have tried using default
> > labels
> > > like "unknown", "NO_LABEL". It doesnt seem to work.
> > >
> > >
> > > *         How to set the class label as "unknown" in the testing
> dataset.
> > >
> > > Looking forward to your reply,
> > >
> > > Thanks
> > > Ranjitha.
> > >
> > >
> > >
> > > ::DISCLAIMER::
> > >
> > >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> > >
> > > The contents of this e-mail and any attachment(s) are confidential and
> > > intended for the named recipient(s) only.
> > > E-mail transmission is not guaranteed to be secure or error-free as
> > > information could be intercepted, corrupted,
> > > lost, destroyed, arrive late or incomplete, or may contain viruses in
> > > transmission. The e mail and its contents
> > > (with or without referred errors) shall therefore not attach any
> > liability
> > > on the originator or HCL or its affiliates.
> > > Views or opinions, if any, presented in this email are solely those of
> > the
> > > author and may not necessarily reflect the
> > > views or opinions of HCL or its affiliates. Any form of reproduction,
> > > dissemination, copying, disclosure, modification,
> > > distribution and / or publication of this message without the prior
> > > written consent of authorized representative of
> > > HCL is strictly prohibited. If you have received this email in error
> > > please delete it and notify the sender immediately.
> > > Before opening any email and/or attachments, please check them for
> > viruses
> > > and other defects.
> > >
> > >
> > >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> > >
> >
>

RE: Issue with Partial Implementation Problem

Posted by Ranjitha Chandrashekar <Ra...@hcl.com>.

Hi Deneche,

Thanks. As suggested, I replaced the label value as "normal" in KDDTest dataset and tested the forest without -a option.
It generates a binary file(.out file) with values 0 and 1.

In order to interpret this I have gone through the code and hence understand that MR job (Classifier.CMapper) generates a file with Key -> Correct Label and Value -> Prediction. Then it creates a new file with .out extension which only contains Values i.e. Prediction(0 or 1) in my case and then it deletes the previous file generated by the MR job. Hence I do not have access to the file generated by MR job which contains Correct Label and Prediction for each input Test record

After looking at these predictions I am not sure what 0 and 1 actually means . Does 1 mean its classified correctly..? "normal" in this case and 0 means the classification is wrong and should be "anamoly"?

Please Suggest

Regards
Ranjitha

-----Original Message-----
From: deneche abdelhakim [mailto:adeneche@gmail.com] 
Sent: 18 January 2013 12:21
To: user@mahout.apache.org
Subject: Re: Issue with Partial Implementation Problem

My mistake. You should put any label value available in the training set.
In the previous example, putting "normal" in all test record should be fine.


On Fri, Jan 18, 2013 at 7:26 AM, Ranjitha Chandrashekar <Ranjitha.Ch@hcl.com
> wrote:

> Hi Deneche
>
> Thank you for your quick response.
>
> I tried using the numerical value in the label attribute in the test data.
>
> Original Record in KDDTest :
> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,normal
>
> Replaced Record :
>
> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,1
>
> (normal class replaced with numerical value 1)
>
> Ran TestForest on KDDTest dataset. Following is the error that i get.
> Sequential and map reduce classification gives the same error.
>
> Command --> hadoop jar
> /usr/lib/mahout-0.5/mahout-examples-0.5-cdh3u5-job.jar
> org.apache.mahout.df.mapreduce.TestForest -i
> /user/ranjitha/input/KDDTest+.arff.txt_withnum -ds
> /user/ranjitha/input/KDDTrain+.info -m /user/ranjitha/KDDForest -o
> /user/ranjitha/KDDResult
>
> 13/01/18 11:29:24 INFO mapreduce.TestForest: Loading the forest...
> 13/01/18 11:29:24 INFO mapreduce.TestForest: Sequential classification...
> 13/01/18 11:29:24 ERROR data.DataConverter: label token: 1 dataset.labels:
> [normal, anomaly] Exception in thread "main"
> java.lang.IllegalStateException: Label value (1) not known
>         at
> org.apache.mahout.df.data.DataConverter.convert(DataConverter.java:71)
>         at
> org.apache.mahout.df.mapreduce.TestForest.testFile(TestForest.java:256)
>         at
> org.apache.mahout.df.mapreduce.TestForest.sequential(TestForest.java:216)
>         at
> org.apache.mahout.df.mapreduce.TestForest.testForest(TestForest.java:172)
>         at
> org.apache.mahout.df.mapreduce.TestForest.run(TestForest.java:142)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at
> org.apache.mahout.df.mapreduce.TestForest.main(TestForest.java:275)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:616)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> Looking forward to your reply
>
> Thanks
> Ranjitha.
>
> -----Original Message-----
> From: deneche abdelhakim [mailto:adeneche@gmail.com]
> Sent: 17 January 2013 18:20
> To: user@mahout.apache.org
> Subject: Re: Issue with Partial Implementation Problem
>
> Hi Ranjitha,
>
> just put any numerical value in the label attribute. You should be able to
> classify the data, but you won't be able to compute the confusion matrix or
> the accuracy.
>
>
> On Thu, Jan 17, 2013 at 12:15 PM, Ranjitha Chandrashekar <
> Ranjitha.Ch@hcl.com> wrote:
>
> > Hi
> >
> > I am using Partial Implementation for Random Forest classification.
> >
> > I have a training dataset with labels class0, class 1, class 2.  The
> > decision forest is built on this training dataset.  The classification
> for
> > the test dataset is computed using the same data descriptor generated for
> > the training dataset.  I am able to generate confusion matrix, accuracy
> > details with the test data set with class variable.
> >
> > However I also need to make a classification for a scenario, where test
> > data may not have the class variable or class values are not known.  For
> > ex, assume test data is about future data points, for which class values
> > will have to be computed only in the future.
> >
> >
> > *         How is it possible to classify the test data set, where the
> > class label is not defined or not known. I have tried using default
> labels
> > like "unknown", "NO_LABEL". It doesnt seem to work.
> >
> >
> > *         How to set the class label as "unknown" in the testing dataset.
> >
> > Looking forward to your reply,
> >
> > Thanks
> > Ranjitha.
> >
> >
> >
> > ::DISCLAIMER::
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > The contents of this e-mail and any attachment(s) are confidential and
> > intended for the named recipient(s) only.
> > E-mail transmission is not guaranteed to be secure or error-free as
> > information could be intercepted, corrupted,
> > lost, destroyed, arrive late or incomplete, or may contain viruses in
> > transmission. The e mail and its contents
> > (with or without referred errors) shall therefore not attach any
> liability
> > on the originator or HCL or its affiliates.
> > Views or opinions, if any, presented in this email are solely those of
> the
> > author and may not necessarily reflect the
> > views or opinions of HCL or its affiliates. Any form of reproduction,
> > dissemination, copying, disclosure, modification,
> > distribution and / or publication of this message without the prior
> > written consent of authorized representative of
> > HCL is strictly prohibited. If you have received this email in error
> > please delete it and notify the sender immediately.
> > Before opening any email and/or attachments, please check them for
> viruses
> > and other defects.
> >
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> >
>

Re: Issue with Partial Implementation Problem

Posted by deneche abdelhakim <ad...@gmail.com>.

My mistake. You should put any label value available in the training set.
In the previous example, putting "normal" in all test record should be fine.


On Fri, Jan 18, 2013 at 7:26 AM, Ranjitha Chandrashekar <Ranjitha.Ch@hcl.com
> wrote:

> Hi Deneche
>
> Thank you for your quick response.
>
> I tried using the numerical value in the label attribute in the test data.
>
> Original Record in KDDTest :
> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,normal
>
> Replaced Record :
>
> 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,1
>
> (normal class replaced with numerical value 1)
>
> Ran TestForest on KDDTest dataset. Following is the error that i get.
> Sequential and map reduce classification gives the same error.
>
> Command --> hadoop jar
> /usr/lib/mahout-0.5/mahout-examples-0.5-cdh3u5-job.jar
> org.apache.mahout.df.mapreduce.TestForest -i
> /user/ranjitha/input/KDDTest+.arff.txt_withnum -ds
> /user/ranjitha/input/KDDTrain+.info -m /user/ranjitha/KDDForest -o
> /user/ranjitha/KDDResult
>
> 13/01/18 11:29:24 INFO mapreduce.TestForest: Loading the forest...
> 13/01/18 11:29:24 INFO mapreduce.TestForest: Sequential classification...
> 13/01/18 11:29:24 ERROR data.DataConverter: label token: 1 dataset.labels:
> [normal, anomaly] Exception in thread "main"
> java.lang.IllegalStateException: Label value (1) not known
>         at
> org.apache.mahout.df.data.DataConverter.convert(DataConverter.java:71)
>         at
> org.apache.mahout.df.mapreduce.TestForest.testFile(TestForest.java:256)
>         at
> org.apache.mahout.df.mapreduce.TestForest.sequential(TestForest.java:216)
>         at
> org.apache.mahout.df.mapreduce.TestForest.testForest(TestForest.java:172)
>         at
> org.apache.mahout.df.mapreduce.TestForest.run(TestForest.java:142)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at
> org.apache.mahout.df.mapreduce.TestForest.main(TestForest.java:275)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:616)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> Looking forward to your reply
>
> Thanks
> Ranjitha.
>
> -----Original Message-----
> From: deneche abdelhakim [mailto:adeneche@gmail.com]
> Sent: 17 January 2013 18:20
> To: user@mahout.apache.org
> Subject: Re: Issue with Partial Implementation Problem
>
> Hi Ranjitha,
>
> just put any numerical value in the label attribute. You should be able to
> classify the data, but you won't be able to compute the confusion matrix or
> the accuracy.
>
>
> On Thu, Jan 17, 2013 at 12:15 PM, Ranjitha Chandrashekar <
> Ranjitha.Ch@hcl.com> wrote:
>
> > Hi
> >
> > I am using Partial Implementation for Random Forest classification.
> >
> > I have a training dataset with labels class0, class 1, class 2.  The
> > decision forest is built on this training dataset.  The classification
> for
> > the test dataset is computed using the same data descriptor generated for
> > the training dataset.  I am able to generate confusion matrix, accuracy
> > details with the test data set with class variable.
> >
> > However I also need to make a classification for a scenario, where test
> > data may not have the class variable or class values are not known.  For
> > ex, assume test data is about future data points, for which class values
> > will have to be computed only in the future.
> >
> >
> > *         How is it possible to classify the test data set, where the
> > class label is not defined or not known. I have tried using default
> labels
> > like "unknown", "NO_LABEL". It doesnt seem to work.
> >
> >
> > *         How to set the class label as "unknown" in the testing dataset.
> >
> > Looking forward to your reply,
> >
> > Thanks
> > Ranjitha.
> >
> >
> >
> > ::DISCLAIMER::
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > The contents of this e-mail and any attachment(s) are confidential and
> > intended for the named recipient(s) only.
> > E-mail transmission is not guaranteed to be secure or error-free as
> > information could be intercepted, corrupted,
> > lost, destroyed, arrive late or incomplete, or may contain viruses in
> > transmission. The e mail and its contents
> > (with or without referred errors) shall therefore not attach any
> liability
> > on the originator or HCL or its affiliates.
> > Views or opinions, if any, presented in this email are solely those of
> the
> > author and may not necessarily reflect the
> > views or opinions of HCL or its affiliates. Any form of reproduction,
> > dissemination, copying, disclosure, modification,
> > distribution and / or publication of this message without the prior
> > written consent of authorized representative of
> > HCL is strictly prohibited. If you have received this email in error
> > please delete it and notify the sender immediately.
> > Before opening any email and/or attachments, please check them for
> viruses
> > and other defects.
> >
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> >
>

RE: Issue with Partial Implementation Problem

Posted by Ranjitha Chandrashekar <Ra...@hcl.com>.

Hi Deneche

Thank you for your quick response.

I tried using the numerical value in the label attribute in the test data.

Original Record in KDDTest : 13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,normal

Replaced Record :
13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,1

(normal class replaced with numerical value 1)

Ran TestForest on KDDTest dataset. Following is the error that i get. Sequential and map reduce classification gives the same error. 

Command --> hadoop jar /usr/lib/mahout-0.5/mahout-examples-0.5-cdh3u5-job.jar org.apache.mahout.df.mapreduce.TestForest -i /user/ranjitha/input/KDDTest+.arff.txt_withnum -ds /user/ranjitha/input/KDDTrain+.info -m /user/ranjitha/KDDForest -o /user/ranjitha/KDDResult

13/01/18 11:29:24 INFO mapreduce.TestForest: Loading the forest...
13/01/18 11:29:24 INFO mapreduce.TestForest: Sequential classification...
13/01/18 11:29:24 ERROR data.DataConverter: label token: 1 dataset.labels: [normal, anomaly] Exception in thread "main" java.lang.IllegalStateException: Label value (1) not known
        at org.apache.mahout.df.data.DataConverter.convert(DataConverter.java:71)
        at org.apache.mahout.df.mapreduce.TestForest.testFile(TestForest.java:256)
        at org.apache.mahout.df.mapreduce.TestForest.sequential(TestForest.java:216)
        at org.apache.mahout.df.mapreduce.TestForest.testForest(TestForest.java:172)
        at org.apache.mahout.df.mapreduce.TestForest.run(TestForest.java:142)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.mahout.df.mapreduce.TestForest.main(TestForest.java:275)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Looking forward to your reply

Thanks
Ranjitha.

-----Original Message-----
From: deneche abdelhakim [mailto:adeneche@gmail.com] 
Sent: 17 January 2013 18:20
To: user@mahout.apache.org
Subject: Re: Issue with Partial Implementation Problem

Hi Ranjitha,

just put any numerical value in the label attribute. You should be able to
classify the data, but you won't be able to compute the confusion matrix or
the accuracy.


On Thu, Jan 17, 2013 at 12:15 PM, Ranjitha Chandrashekar <
Ranjitha.Ch@hcl.com> wrote:

> Hi
>
> I am using Partial Implementation for Random Forest classification.
>
> I have a training dataset with labels class0, class 1, class 2.  The
> decision forest is built on this training dataset.  The classification for
> the test dataset is computed using the same data descriptor generated for
> the training dataset.  I am able to generate confusion matrix, accuracy
> details with the test data set with class variable.
>
> However I also need to make a classification for a scenario, where test
> data may not have the class variable or class values are not known.  For
> ex, assume test data is about future data points, for which class values
> will have to be computed only in the future.
>
>
> *         How is it possible to classify the test data set, where the
> class label is not defined or not known. I have tried using default labels
> like "unknown", "NO_LABEL". It doesnt seem to work.
>
>
> *         How to set the class label as "unknown" in the testing dataset.
>
> Looking forward to your reply,
>
> Thanks
> Ranjitha.
>
>
>
> ::DISCLAIMER::
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>

Re: Issue with Partial Implementation Problem

Posted by deneche abdelhakim <ad...@gmail.com>.

Hi Ranjitha,

just put any numerical value in the label attribute. You should be able to
classify the data, but you won't be able to compute the confusion matrix or
the accuracy.


On Thu, Jan 17, 2013 at 12:15 PM, Ranjitha Chandrashekar <
Ranjitha.Ch@hcl.com> wrote:

> Hi
>
> I am using Partial Implementation for Random Forest classification.
>
> I have a training dataset with labels class0, class 1, class 2.  The
> decision forest is built on this training dataset.  The classification for
> the test dataset is computed using the same data descriptor generated for
> the training dataset.  I am able to generate confusion matrix, accuracy
> details with the test data set with class variable.
>
> However I also need to make a classification for a scenario, where test
> data may not have the class variable or class values are not known.  For
> ex, assume test data is about future data points, for which class values
> will have to be computed only in the future.
>
>
> *         How is it possible to classify the test data set, where the
> class label is not defined or not known. I have tried using default labels
> like "unknown", "NO_LABEL". It doesnt seem to work.
>
>
> *         How to set the class label as "unknown" in the testing dataset.
>
> Looking forward to your reply,
>
> Thanks
> Ranjitha.
>
>
>
> ::DISCLAIMER::
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>