You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Xiaobo Gu <gu...@gmail.com> on 2011/08/06 04:54:15 UTC

trainAdaptiveLogistic KDD data set test result

Hi,
I have done a very simple test against the KDD datasets wich can be
download from http://nsl.cs.unb.ca/NSL-KDD/,
the KDDTrain+.TXT and KDDTest+.TXT are used a training and test
dataset respectively, and I add the following header to them to let
them be CSV files:
duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class

The training and validate commands are as following:
mahout trainAdaptiveLogistic --input d:\\train.csv --output d:\\model1
--target class --categories 2 --predictors duration protocol_type
service flag src_bytes dst_bytes land wrong_fragment urgent hot
num_failed_logins logged_in num_compromised root_shell su_attempted
num_root num_file_creations num_shells num_access_files
num_outbound_cmds is_host_login is_guest_login count srv_count
serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate
diff_srv_rate srv_diff_host_rate dst_host_count dst_host_srv_count
dst_host_same_srv_rate dst_host_diff_srv_rate
dst_host_same_src_port_rate dst_host_srv_diff_host_rate
dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate
dst_host_srv_rerror_rate --types numeric word word word numeric
numeric word numeric numeric numeric numeric word numeric numeric
numeric numeric numeric numeric numeric numeric word word numeric
numeric numeric numeric numeric numeric numeric numeric numeric
numeric numeric numeric numeric numeric numeric numeric numeric
numeric numeric numeric numeric numeric numeric numeric numeric
numeric numeric numeric numeric  --threads 4 --passes 1 --showperf
--features 500 --skipperfnum 399

mahout validateAdaptiveLogistic --input d:\\test.csv --model
d:\\model1 --auc --confusion --scores

And the output of validateAdaptiveLogistic on my system is as following:


Log-likelihood:Min=-100.00, Max=0.00, Mean=-27.13, Median=0.00

AUC = 0.48

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	<--Classified as
9711 	0    	0    	 |  9711  	a     = normal
0    	12833	0    	 |  12833 	b     = anomaly
0    	0    	0    	 |  0     	c     = unknown
Default Category: unknown: 2



Entropy Matrix: [[NaN, NaN], [-0.0, -0.0]]

There are a few questions about the output:
#1, From the Confusion Matrix, it seems all the records are get
classified correctly, but the AUC is just 0.48, it should be 1.
#2, What does the number after unknown mean, the internal code for
unknown? Since the confusionmatrix is created with default category
named unknown, so unknown will always shown in the result, even no
records are unknown, just like this example?
#3,The Entropy Matirx seems not working too.
#4,Since the result is 100% correct, is there something wrong?

Regards,

Xiaobo Gu

Re: trainAdaptiveLogistic KDD data set test result

Posted by Ted Dunning <te...@gmail.com>.
There are odd numbers.  The first problem is that you have a three-way
target and still have AUC.  That shouldn't work.  Likewise, you seem to have
no data for the third case.  This results in NaN when logs are taken.

Try re-running this without the third target variable.

On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:

> And the output of validateAdaptiveLogistic on my system is as following:
>
>
> Log-likelihood:Min=-100.00, Max=0.00, Mean=-27.13, Median=0.00
>
> AUC = 0.48
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a       b       c       <--Classified as
> 9711    0       0        |  9711        a     = normal
> 0       12833   0        |  12833       b     = anomaly
> 0       0       0        |  0           c     = unknown
> Default Category: unknown: 2
>
>
>
> Entropy Matrix: [[NaN, NaN], [-0.0, -0.0]]
>
> There are a few questions about the output:
> #1, From the Confusion Matrix, it seems all the records are get
> classified correctly, but the AUC is just 0.48, it should be 1.
> #2, What does the number after unknown mean, the internal code for
> unknown? Since the confusionmatrix is created with default category
> named unknown, so unknown will always shown in the result, even no
> records are unknown, just like this example?
> #3,The Entropy Matirx seems not working too.
> #4,Since the result is 100% correct, is there something wrong?
>

Re: trainAdaptiveLogistic KDD data set test result

Posted by Xiaobo Gu <gu...@gmail.com>.
I use the KDDTrain+.TXT and KDDTest+.TXT files for training and
validating respectively

On Sat, Aug 6, 2011 at 11:21 AM, Ted Dunning <te...@gmail.com> wrote:
> I would be very suspicious.  KDD problems are not normally so easy.
>
> On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> #4,Since the result is 100% correct, is there something wrong?
>>
>

Re: trainAdaptiveLogistic KDD data set test result

Posted by Xiaobo Gu <gu...@gmail.com>.
I use the KDDTrain+.TXT and KDDTest+.TXT files for training and
validating respectively

On Sat, Aug 6, 2011 at 11:21 AM, Ted Dunning <te...@gmail.com> wrote:
> I would be very suspicious.  KDD problems are not normally so easy.
>
> On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> #4,Since the result is 100% correct, is there something wrong?
>>
>

Re: trainAdaptiveLogistic KDD data set test result

Posted by Ted Dunning <te...@gmail.com>.
I would be very suspicious.  KDD problems are not normally so easy.

On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:

> #4,Since the result is 100% correct, is there something wrong?
>

Re: trainAdaptiveLogistic KDD data set test result

Posted by Ted Dunning <te...@gmail.com>.
There are odd numbers.  The first problem is that you have a three-way
target and still have AUC.  That shouldn't work.  Likewise, you seem to have
no data for the third case.  This results in NaN when logs are taken.

Try re-running this without the third target variable.

On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:

> And the output of validateAdaptiveLogistic on my system is as following:
>
>
> Log-likelihood:Min=-100.00, Max=0.00, Mean=-27.13, Median=0.00
>
> AUC = 0.48
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a       b       c       <--Classified as
> 9711    0       0        |  9711        a     = normal
> 0       12833   0        |  12833       b     = anomaly
> 0       0       0        |  0           c     = unknown
> Default Category: unknown: 2
>
>
>
> Entropy Matrix: [[NaN, NaN], [-0.0, -0.0]]
>
> There are a few questions about the output:
> #1, From the Confusion Matrix, it seems all the records are get
> classified correctly, but the AUC is just 0.48, it should be 1.
> #2, What does the number after unknown mean, the internal code for
> unknown? Since the confusionmatrix is created with default category
> named unknown, so unknown will always shown in the result, even no
> records are unknown, just like this example?
> #3,The Entropy Matirx seems not working too.
> #4,Since the result is 100% correct, is there something wrong?
>

Re: trainAdaptiveLogistic KDD data set test result

Posted by Xiaobo Gu <gu...@gmail.com>.
But Weka 3.7.4 and SAS EM give reasonable results, you can try and see.

On Sat, Aug 6, 2011 at 12:28 PM, Xiaobo Gu <gu...@gmail.com> wrote:
> On Sat, Aug 6, 2011 at 12:18 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>> ValidateAdaptiveLogistic now has a default option let user specify the
>> default valuefor confusion matrix, now the output is:
>>
>> Log-likelihood:Min=-100.00, Max=0.00, Mean=-27.10, Median=0.00
>>
>> AUC = 0.48
>>
>> =======================================================
>> Confusion Matrix
>> -------------------------------------------------------
>> a       c       <--Classified as
>> 9711    0        |  9711        a     = normal
>> 0       12833    |  12833       c     = anomaly
>> Default Category: anomaly: 2
>>
>>
>>
>> Entropy Matrix: [[NaN, NaN], [-0.0, -0.0]]
>>
>> It seems AUC and Entropy still can't work, and the labels confusion
>> matrix use should sequencially, that's it should use a, b, c, d, e
>> ....., but here b is not used.
>>
>>
>> On Sat, Aug 6, 2011 at 11:21 AM, Ted Dunning <te...@gmail.com> wrote:
>>> Try setting the default category to 'a' when the confusion matrix is
>>> constructed.
>>>
>>> On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>>>
>>>> #2, What does the number after unknown mean, the internal code for
>>>> unknown? Since the confusionmatrix is created with default category
>>>> named unknown, so unknown will always shown in the result, even no
>>>> records are unknown, just like this example?
>>>>
>>>
>>
>

Re: trainAdaptiveLogistic KDD data set test result

Posted by Xiaobo Gu <gu...@gmail.com>.
But Weka 3.7.4 and SAS EM give reasonable results, you can try and see.

On Sat, Aug 6, 2011 at 12:28 PM, Xiaobo Gu <gu...@gmail.com> wrote:
> On Sat, Aug 6, 2011 at 12:18 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>> ValidateAdaptiveLogistic now has a default option let user specify the
>> default valuefor confusion matrix, now the output is:
>>
>> Log-likelihood:Min=-100.00, Max=0.00, Mean=-27.10, Median=0.00
>>
>> AUC = 0.48
>>
>> =======================================================
>> Confusion Matrix
>> -------------------------------------------------------
>> a       c       <--Classified as
>> 9711    0        |  9711        a     = normal
>> 0       12833    |  12833       c     = anomaly
>> Default Category: anomaly: 2
>>
>>
>>
>> Entropy Matrix: [[NaN, NaN], [-0.0, -0.0]]
>>
>> It seems AUC and Entropy still can't work, and the labels confusion
>> matrix use should sequencially, that's it should use a, b, c, d, e
>> ....., but here b is not used.
>>
>>
>> On Sat, Aug 6, 2011 at 11:21 AM, Ted Dunning <te...@gmail.com> wrote:
>>> Try setting the default category to 'a' when the confusion matrix is
>>> constructed.
>>>
>>> On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>>>
>>>> #2, What does the number after unknown mean, the internal code for
>>>> unknown? Since the confusionmatrix is created with default category
>>>> named unknown, so unknown will always shown in the result, even no
>>>> records are unknown, just like this example?
>>>>
>>>
>>
>

Re: trainAdaptiveLogistic KDD data set test result

Posted by Xiaobo Gu <gu...@gmail.com>.
ValidateAdaptiveLogistic now has a default option let user specify the
default valuefor confusion matrix, now the output is:

Log-likelihood:Min=-100.00, Max=0.00, Mean=-27.10, Median=0.00

AUC = 0.48

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	c    	<--Classified as
9711 	0    	 |  9711  	a     = normal
0    	12833	 |  12833 	c     = anomaly
Default Category: anomaly: 2



Entropy Matrix: [[NaN, NaN], [-0.0, -0.0]]

It seems AUC and Entropy still can't work, and the labels confusion
matrix use should sequencially, that's it should use a, b, c, d, e
....., but here b is not used.


On Sat, Aug 6, 2011 at 11:21 AM, Ted Dunning <te...@gmail.com> wrote:
> Try setting the default category to 'a' when the confusion matrix is
> constructed.
>
> On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> #2, What does the number after unknown mean, the internal code for
>> unknown? Since the confusionmatrix is created with default category
>> named unknown, so unknown will always shown in the result, even no
>> records are unknown, just like this example?
>>
>

Re: trainAdaptiveLogistic KDD data set test result

Posted by Xiaobo Gu <gu...@gmail.com>.
ValidateAdaptiveLogistic now has a default option let user specify the
default valuefor confusion matrix, now the output is:

Log-likelihood:Min=-100.00, Max=0.00, Mean=-27.10, Median=0.00

AUC = 0.48

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	c    	<--Classified as
9711 	0    	 |  9711  	a     = normal
0    	12833	 |  12833 	c     = anomaly
Default Category: anomaly: 2



Entropy Matrix: [[NaN, NaN], [-0.0, -0.0]]

It seems AUC and Entropy still can't work, and the labels confusion
matrix use should sequencially, that's it should use a, b, c, d, e
....., but here b is not used.


On Sat, Aug 6, 2011 at 11:21 AM, Ted Dunning <te...@gmail.com> wrote:
> Try setting the default category to 'a' when the confusion matrix is
> constructed.
>
> On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> #2, What does the number after unknown mean, the internal code for
>> unknown? Since the confusionmatrix is created with default category
>> named unknown, so unknown will always shown in the result, even no
>> records are unknown, just like this example?
>>
>

Re: trainAdaptiveLogistic KDD data set test result

Posted by Ted Dunning <te...@gmail.com>.
Try setting the default category to 'a' when the confusion matrix is
constructed.

On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:

> #2, What does the number after unknown mean, the internal code for
> unknown? Since the confusionmatrix is created with default category
> named unknown, so unknown will always shown in the result, even no
> records are unknown, just like this example?
>

Re: trainAdaptiveLogistic KDD data set test result

Posted by Ted Dunning <te...@gmail.com>.
I would be very suspicious.  KDD problems are not normally so easy.

On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:

> #4,Since the result is 100% correct, is there something wrong?
>

Re: trainAdaptiveLogistic KDD data set test result

Posted by Ted Dunning <te...@gmail.com>.
Try setting the default category to 'a' when the confusion matrix is
constructed.

On Fri, Aug 5, 2011 at 7:54 PM, Xiaobo Gu <gu...@gmail.com> wrote:

> #2, What does the number after unknown mean, the internal code for
> unknown? Since the confusionmatrix is created with default category
> named unknown, so unknown will always shown in the result, even no
> records are unknown, just like this example?
>