You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samoa.apache.org by Gianmarco De Francisci Morales <gd...@apache.org> on 2015/09/07 17:23:40 UTC

Fwd: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD Cup 99 Data Set

Forwarding to the @dev list.
--
Gianmarco

---------- Forwarded message ----------
From: Ercan Öztürk <e....@gmail.com>
Date: 7 September 2015 at 16:57
Subject: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD Cup
99 Data Set
To: gdfm@apache.org


Hi Mr. Morales and Mr. Bifet,

We are a couple of undergrad students from TOBB University. As a data
mining class project, we decided to run HoeffdingTree classifier-in moa and
VerticalHoeffdingTree classifier-in samoa on KDD Cup 99 data set (couldn't
attach the data set to this mail due to the size limitations of the Apache
mail server) and present the results in our project report.

We were able to run HoeffdingTree Algorithm on the KDD Cup 99 (both on
kddcup_full.arff, kddcup_10_percent.arff) data set. VerticalHoeffdingTree
classifier also works fine on kddcup_10_percent.arff. However, when we try
to run the VerticalHoeffdingTree classifier on kddcup_full.arff, we got the
following error:

The command we use to run SAMOA Local:

bin/samoa local target/SAMOA-Local-0.3.0-SNAPSHOT.jar
"PrequentialEvaluation -i -1 -f 41920 -l
(com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"

The console output of samoa:

bin/samoa

Deploying to LOCAL

Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
(com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)

2015-09-01 22:22:16,160 [main] INFO  com.yahoo.labs.samoa.LocalDoTask
(LocalDoTask.java:79) - Successfully instantiating
com.yahoo.labs.samoa.tasks.PrequentialEvaluation

2015-09-01 22:22:17,741 [main] INFO
 com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
(EvaluatorProcessor.java:86) - 1 seconds for 41920 instances

2015-09-01 22:22:17,760 [main] INFO
 com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
(EvaluatorProcessor.java:172) - evaluation instances = 41,920

classified instances = 41,920

classifications correct (percent) = 99.988

Kappa Statistic (percent) = -0.002

Kappa Temporal Statistic (percent) = 28.571

Exception in thread "main" java.lang.NullPointerException

at
com.yahoo.labs.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:145)

at
com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)

at com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)

at com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)

at
com.yahoo.labs.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:95)

at
com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)

at com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)

at com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)

at
com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:46)

at
com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:66)

at
com.yahoo.labs.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:42)

at
com.yahoo.labs.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)

at com.yahoo.labs.samoa.LocalDoTask.main(LocalDoTask.java:87)


We were able to track down the problem to the first instance that causes
it; the instance is on the 76426th line in kddcup_full.arff. The instance
is as follows:

1,tcp,smtp,SF,2252,331,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,7,0,0,0,0,1,0,1,5,216,1,0,0.2,0.01,0,0,0,0,normal

We haven’t noticed any differences between the problematic instance and the
other instances. Could you lead us to the root of the problem and could you
help us on how to overcome this problem?

As a workaround we’ve made the following addition to
ModelAggregatorProcessor.java

if (leafNode == null)

        return false;

after the line

ActiveLearningNode leafNode = (ActiveLearningNode) foundNode.getNode();

Now, also VeriticalHoeffdingTree Classifier works fine on kddcup_full.arff. Is
this solution acceptable for the problem, what do you think?


Besides, we were wondering how we could fetch model contents such as
visiting nodes and node content etc.

Thanks for your help,


Respectfully,

Ercan Ozturk, Davut Deniz Yavuz, Gozde Boztepe, Sezin Gurkan

Re: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD Cup 99 Data Set

Posted by Ercan Öztürk <e....@gmail.com>.
Hi,

No problem, please let us know if you have any problems running samoa on
the dataset.

Thank you for the tip. We are now calling the tree printing function using
isLastEvent() call and it looks like it is working.

About the NPE,

In the filterInstanceToLeafmethod
<https://github.com/apache/incubator-samoa/blob/9b178f63152e5b4c262e0f3ed28e77667832fc98/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/SplitNode.java#L49>
of SplitNode, even if the childIndex returned from instanceChildIndex
function is greater than or equal to zero, that index in the children array
is null and that results in a null node which is returned.

The function getChild seemed pretty straight forward to me and I looked
into the other function instanceChildIndex. Since the instanceChildIndex
method uses the splitTest's branchforInstance function I thought that that
might be the issue. I checked the runtime class of the splitTest when the
error occured and it was NominalAttributeMultiwayTest
<https://github.com/apache/incubator-samoa/blob/9b178f63152e5b4c262e0f3ed28e77667832fc98/samoa-api/src/main/java/org/apache/samoa/moa/classifiers/core/conditionaltests/NominalAttributeMultiwayTest.java>.
In the branchForInstance function of the class, value is returned
<https://github.com/apache/incubator-samoa/blob/9b178f63152e5b4c262e0f3ed28e77667832fc98/samoa-api/src/main/java/org/apache/samoa/moa/classifiers/core/conditionaltests/NominalAttributeMultiwayTest.java#L46>
instead
of index. Do you think this can be the problem? I guess the code is
migrated from moa and there is logic behind it. However, I couldn't find
any NominalAttributeMultiwayTest objects created by using the value instead
of attribute index. That may result in a null node.

What do you think of the explanation above as the source of the NPE? Let me
know if I am missing something. If you point me to another probable cause,
I can look into that, too.

Thanks.

Respectfully,
Ercan Ozturk



2015-09-16 15:45 GMT+03:00 Gianmarco De Francisci Morales <gd...@apache.org>:

> Hi,
>
> Thanks for sharing the dataset.
>
> Yes, the tree consists only of internal nodes (split) and leaves
> (learning).
> Where to call the function is a good question.
> Given that we are dealing with unbounded streams, in theory there is no
> "end of the stream", so there is no final form of the tree.
> Of course this is not true when you are running offline on a fixed dataset.
> The class ContentEvent has a method public boolean isLastEvent() which
> should allow to catch the last event in a data stream from a file.
> This is implemented in InstanceContentEvent, and used
> in PrequentialSourceProcessor (though its use is not very consistent
> throughout the codebase).
> So one way to do it could be to catch this event and call the dumping
> function if isLastEvent is true.
>
> Thanks for debugging the NPE.
> If I am not mistaken, there should not be a case where the leaf is null,
> i.e., one should always be able to sort an instance to a leaf.
> Even if the instance is sparse, the missing values of the sparse instance
> are defined.
> Do you know why that instance fails to be sorted to its correct leaf?
>
> Cheers,
>
>
> --
> Gianmarco
>
> On 15 September 2015 at 15:50, Davut Deniz Yavuz <
> davut.deniz.yavuz@gmail.com> wrote:
>
>> Hi,
>>
>> We have implemented a recursive method to dump the tree by implementing
>> the describeSubTree methods in  ActiveLearningNode and SplitNode. (We
>> checked the tree and it only consists of these types of nodes.) We print
>> the tree by calling the describeSubTree function of the root of the tree.
>>
>> Unfortunately, we couldn’t decide where to call the
>> treeRoot.describeSubTree function. Currently, we are calling it at the end
>> of the attemptToSplit
>> <https://github.com/apache/incubator-samoa/blob/9b178f63152e5b4c262e0f3ed28e77667832fc98/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L463> function.
>> Therefore, the code prints a couple of trees and we are not sure if it
>> prints the final form of the tree. Which place would be more appropriate to
>> call this function to print just the final form of the tree?
>>
>> --
>>
>> We debugged the NullPointerException and we’ve found out that the source
>> of the problem is in the filterInstanceToLeaf method in SplitNode class.
>> The method returns new FoundNode(null, this, childIndex)
>> <https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/SplitNode.java#L56>
>> when the child is null. This method is called from ModelAggregatorProcessor
>> by trainOnInstanceImpl(Instance inst)
>> <https://github.com/apache/incubator-samoa/blob/9b178f63152e5b4c262e0f3ed28e77667832fc98/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L425>.
>> Thus, a node returning null when the getNode function of it called is
>> produced in the 394. line:
>>
>> FoundNode foundNode = this.treeRoot.filterInstanceToLeaf(inst, null, -1)
>> <https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L394>.
>>
>>
>> This node, then, is added to foundNodeSet by the line
>> this.foundNodeSet.add(foundNode)
>> <https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L452>.
>> Because of this, while looping
>> <https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L140>over
>> the elements in the foundNodeSet, the getNode() function of the leafNode
>> node returns a null node and it causes a NullPointerException.
>>
>> To solve the problem we checked how moa handles the null node, however,
>> to our knowledge, there is not an alike structure like foundNodeSet in moa.
>>
>> --
>>
>> The KDD Cup 99 Dataset can be downloaded from here
>> <https://drive.google.com/file/d/0B_XfRoyTW4lJWFF5TWttR0VLTnhCbERrcEtjY2thWGtLUlZv/view?usp=sharing>
>> .
>>
>> Respectfully,
>> Davut Deniz Yavuz, Ercan Ozturk
>>
>> 2015-09-11 9:57 GMT+03:00 Gianmarco De Francisci Morales <gdfm@apache.org
>> >:
>>
>>> Sure, the ticket is SAMOA-44
>>> <https://issues.apache.org/jira/browse/SAMOA-44>.
>>>
>>> Arinto had started the work on model dumping, I don't know what's the
>>> status there.
>>> But it should be straightforward to implement a recursive method.
>>>
>>> If you could post the dataset somewhere where it is possible to download
>>> it, it would be great.
>>> If you want to take a stab at debugging what's going on and provide a
>>> patch, it would be even better.
>>>
>>> Cheers,
>>>
>>> --
>>> Gianmarco
>>>
>>> On 10 September 2015 at 08:49, Ercan Öztürk <e....@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thank you very much for your quick response.
>>>>
>>>> We were using an older version of SAMOA. I've updated the code now (The
>>>> last commit is currently "SAMOA-29: Excluding the samoa-storm.properties at
>>>> compile time and including at test") and after building the code with "mvn
>>>> package" the new command we use to run SAMOA is
>>>>
>>>> local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar
>>>> "PrequentialEvaluation -i -1 -f 41920 -l
>>>> (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s
>>>> (org.apache.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"
>>>>
>>>> The console output when the command is run:
>>>>
>>>> bin/samoa
>>>> Deploying to LOCAL
>>>> Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
>>>> (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s
>>>> (org.apache.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)
>>>> 2015-09-09 15:56:30,036 [main] INFO  org.apache.samoa.LocalDoTask
>>>> (LocalDoTask.java:80) - Successfully instantiating
>>>> org.apache.samoa.tasks.PrequentialEvaluation
>>>> 2015-09-09 15:56:31,221 [main] INFO
>>>>  org.apache.samoa.evaluation.EvaluatorProcessor
>>>> (EvaluatorProcessor.java:83) - 1 seconds for 41920 instances
>>>> 2015-09-09 15:56:31,227 [main] INFO
>>>>  org.apache.samoa.evaluation.EvaluatorProcessor
>>>> (EvaluatorProcessor.java:169) - evaluation instances = 41,920
>>>> classified instances = 41,920
>>>> classifications correct (percent) = 99.988
>>>> Kappa Statistic (percent) = -0.002
>>>> Kappa Temporal Statistic (percent) = 28.571
>>>> Exception in thread "main" java.lang.NullPointerException
>>>> at
>>>> org.apache.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:142)
>>>> at
>>>> org.apache.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:72)
>>>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:61)
>>>> at
>>>> org.apache.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:93)
>>>> at
>>>> org.apache.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:72)
>>>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:61)
>>>> at
>>>> org.apache.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:45)
>>>> at
>>>> org.apache.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:63)
>>>> at
>>>> org.apache.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:44)
>>>> at
>>>> org.apache.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)
>>>> at org.apache.samoa.LocalDoTask.main(LocalDoTask.java:88)
>>>>
>>>>
>>>> We would be very appreciated if you could send us the link for the
>>>> ticket so we can follow the updates on the issue.
>>>>
>>>> Yes, we would like to dump the model so that we can see the rules of
>>>> the model and have a better understanding of it.
>>>>
>>>> The method body of describeSubtree() in Node.java is currently empty.
>>>> Is there any work done on it that we can use as a starting point?
>>>>
>>>> If you need the data set to investigate the issue, I can send it via
>>>> any suitable channel, please let me know.
>>>>
>>>> Respectfully,
>>>> Ercan Ozturk
>>>>
>>>> 2015-09-09 15:11 GMT+03:00 Gianmarco De Francisci Morales <
>>>> gdfm@apache.org>:
>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks for reporting the bug.
>>>>> I'm not sure what is causing the issue.
>>>>> Are you using the master version of SAMOA?
>>>>> My line 145 of ModelAggregator is:
>>>>>               this.sendToAttributeStream(abce[i]);
>>>>>
>>>>> From what you say it seems that the problem is a bit above, and
>>>>> leafNode is null.
>>>>> However, by construction there should always be a leaf node.
>>>>>
>>>>> As a workaround your solution is fine, but I guess there is some other
>>>>> underlying problem with the code, which might cause some loss in accuracy.
>>>>> We should investigate this issue further, I'll open a ticket.
>>>>>
>>>>> Regarding fetching the content of the model, we had some prototype
>>>>> model dumper code (Arinto had started it), but I guess it's not working
>>>>> anymore. See the describeSubtree() method in Node.java.
>>>>> So unfortunately you need to do it yourself. However, the good thing
>>>>> is that the tree model is in a single place in ModelAggregator, so it
>>>>> should be relatively easy to walk the tree, starting from the root node.
>>>>> Do you want to dump the model to a text representation for human
>>>>> inspection?
>>>>>
>>>>> Cheers,
>>>>>
>>>>>
>>>>> --
>>>>> Gianmarco
>>>>>
>>>>> On 7 September 2015 at 18:23, Gianmarco De Francisci Morales <
>>>>> gdfm@apache.org> wrote:
>>>>>
>>>>>> Forwarding to the @dev list.
>>>>>> --
>>>>>> Gianmarco
>>>>>>
>>>>>> ---------- Forwarded message ----------
>>>>>> From: Ercan Öztürk <e....@gmail.com>
>>>>>> Date: 7 September 2015 at 16:57
>>>>>> Subject: HoeffdingTree and VerticalHoeffdingTree Classifiers run on
>>>>>> KDD Cup 99 Data Set
>>>>>> To: gdfm@apache.org
>>>>>>
>>>>>>
>>>>>> Hi Mr. Morales and Mr. Bifet,
>>>>>>
>>>>>> We are a couple of undergrad students from TOBB University. As a data
>>>>>> mining class project, we decided to run HoeffdingTree classifier-in moa and
>>>>>> VerticalHoeffdingTree classifier-in samoa on KDD Cup 99 data set (couldn't
>>>>>> attach the data set to this mail due to the size limitations of the Apache
>>>>>> mail server) and present the results in our project report.
>>>>>>
>>>>>> We were able to run HoeffdingTree Algorithm on the KDD Cup 99 (both
>>>>>> on kddcup_full.arff, kddcup_10_percent.arff) data set.
>>>>>> VerticalHoeffdingTree classifier also works fine on
>>>>>> kddcup_10_percent.arff. However, when we try to run the
>>>>>> VerticalHoeffdingTree classifier on kddcup_full.arff, we got the
>>>>>> following error:
>>>>>>
>>>>>> The command we use to run SAMOA Local:
>>>>>>
>>>>>> bin/samoa local target/SAMOA-Local-0.3.0-SNAPSHOT.jar
>>>>>> "PrequentialEvaluation -i -1 -f 41920 -l
>>>>>> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
>>>>>> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"
>>>>>>
>>>>>> The console output of samoa:
>>>>>>
>>>>>> bin/samoa
>>>>>>
>>>>>> Deploying to LOCAL
>>>>>>
>>>>>> Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
>>>>>> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
>>>>>> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)
>>>>>>
>>>>>> 2015-09-01 22:22:16,160 [main] INFO  com.yahoo.labs.samoa.LocalDoTask
>>>>>> (LocalDoTask.java:79) - Successfully instantiating
>>>>>> com.yahoo.labs.samoa.tasks.PrequentialEvaluation
>>>>>>
>>>>>> 2015-09-01 22:22:17,741 [main] INFO
>>>>>>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
>>>>>> (EvaluatorProcessor.java:86) - 1 seconds for 41920 instances
>>>>>>
>>>>>> 2015-09-01 22:22:17,760 [main] INFO
>>>>>>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
>>>>>> (EvaluatorProcessor.java:172) - evaluation instances = 41,920
>>>>>>
>>>>>> classified instances = 41,920
>>>>>>
>>>>>> classifications correct (percent) = 99.988
>>>>>>
>>>>>> Kappa Statistic (percent) = -0.002
>>>>>>
>>>>>> Kappa Temporal Statistic (percent) = 28.571
>>>>>>
>>>>>> Exception in thread "main" java.lang.NullPointerException
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:145)
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:95)
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:46)
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:66)
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:42)
>>>>>>
>>>>>> at
>>>>>> com.yahoo.labs.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)
>>>>>>
>>>>>> at com.yahoo.labs.samoa.LocalDoTask.main(LocalDoTask.java:87)
>>>>>>
>>>>>>
>>>>>> We were able to track down the problem to the first instance that
>>>>>> causes it; the instance is on the 76426th line in kddcup_full.arff.
>>>>>> The instance is as follows:
>>>>>>
>>>>>>
>>>>>> 1,tcp,smtp,SF,2252,331,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,7,0,0,0,0,1,0,1,5,216,1,0,0.2,0.01,0,0,0,0,normal
>>>>>>
>>>>>> We haven’t noticed any differences between the problematic instance
>>>>>> and the other instances. Could you lead us to the root of the problem and
>>>>>> could you help us on how to overcome this problem?
>>>>>>
>>>>>> As a workaround we’ve made the following addition to
>>>>>> ModelAggregatorProcessor.java
>>>>>>
>>>>>> if (leafNode == null)
>>>>>>
>>>>>>         return false;
>>>>>>
>>>>>> after the line
>>>>>>
>>>>>> ActiveLearningNode leafNode = (ActiveLearningNode)
>>>>>> foundNode.getNode();
>>>>>>
>>>>>> Now, also VeriticalHoeffdingTree Classifier works fine on kddcup_full.arff.
>>>>>> Is this solution acceptable for the problem, what do you think?
>>>>>>
>>>>>>
>>>>>> Besides, we were wondering how we could fetch model contents such as
>>>>>> visiting nodes and node content etc.
>>>>>>
>>>>>> Thanks for your help,
>>>>>>
>>>>>>
>>>>>> Respectfully,
>>>>>>
>>>>>> Ercan Ozturk, Davut Deniz Yavuz, Gozde Boztepe, Sezin Gurkan
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD Cup 99 Data Set

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Hi,

Thanks for sharing the dataset.

Yes, the tree consists only of internal nodes (split) and leaves (learning).
Where to call the function is a good question.
Given that we are dealing with unbounded streams, in theory there is no
"end of the stream", so there is no final form of the tree.
Of course this is not true when you are running offline on a fixed dataset.
The class ContentEvent has a method public boolean isLastEvent() which
should allow to catch the last event in a data stream from a file.
This is implemented in InstanceContentEvent, and used
in PrequentialSourceProcessor (though its use is not very consistent
throughout the codebase).
So one way to do it could be to catch this event and call the dumping
function if isLastEvent is true.

Thanks for debugging the NPE.
If I am not mistaken, there should not be a case where the leaf is null,
i.e., one should always be able to sort an instance to a leaf.
Even if the instance is sparse, the missing values of the sparse instance
are defined.
Do you know why that instance fails to be sorted to its correct leaf?

Cheers,


--
Gianmarco

On 15 September 2015 at 15:50, Davut Deniz Yavuz <
davut.deniz.yavuz@gmail.com> wrote:

> Hi,
>
> We have implemented a recursive method to dump the tree by implementing
> the describeSubTree methods in  ActiveLearningNode and SplitNode. (We
> checked the tree and it only consists of these types of nodes.) We print
> the tree by calling the describeSubTree function of the root of the tree.
>
> Unfortunately, we couldn’t decide where to call the
> treeRoot.describeSubTree function. Currently, we are calling it at the end
> of the attemptToSplit
> <https://github.com/apache/incubator-samoa/blob/9b178f63152e5b4c262e0f3ed28e77667832fc98/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L463> function.
> Therefore, the code prints a couple of trees and we are not sure if it
> prints the final form of the tree. Which place would be more appropriate to
> call this function to print just the final form of the tree?
>
> --
>
> We debugged the NullPointerException and we’ve found out that the source
> of the problem is in the filterInstanceToLeaf method in SplitNode class.
> The method returns new FoundNode(null, this, childIndex)
> <https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/SplitNode.java#L56>
> when the child is null. This method is called from ModelAggregatorProcessor
> by trainOnInstanceImpl(Instance inst)
> <https://github.com/apache/incubator-samoa/blob/9b178f63152e5b4c262e0f3ed28e77667832fc98/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L425>.
> Thus, a node returning null when the getNode function of it called is
> produced in the 394. line:
>
> FoundNode foundNode = this.treeRoot.filterInstanceToLeaf(inst, null, -1)
> <https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L394>.
>
>
> This node, then, is added to foundNodeSet by the line
> this.foundNodeSet.add(foundNode)
> <https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L452>.
> Because of this, while looping
> <https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L140>over
> the elements in the foundNodeSet, the getNode() function of the leafNode
> node returns a null node and it causes a NullPointerException.
>
> To solve the problem we checked how moa handles the null node, however, to
> our knowledge, there is not an alike structure like foundNodeSet in moa.
>
> --
>
> The KDD Cup 99 Dataset can be downloaded from here
> <https://drive.google.com/file/d/0B_XfRoyTW4lJWFF5TWttR0VLTnhCbERrcEtjY2thWGtLUlZv/view?usp=sharing>
> .
>
> Respectfully,
> Davut Deniz Yavuz, Ercan Ozturk
>
> 2015-09-11 9:57 GMT+03:00 Gianmarco De Francisci Morales <gd...@apache.org>
> :
>
>> Sure, the ticket is SAMOA-44
>> <https://issues.apache.org/jira/browse/SAMOA-44>.
>>
>> Arinto had started the work on model dumping, I don't know what's the
>> status there.
>> But it should be straightforward to implement a recursive method.
>>
>> If you could post the dataset somewhere where it is possible to download
>> it, it would be great.
>> If you want to take a stab at debugging what's going on and provide a
>> patch, it would be even better.
>>
>> Cheers,
>>
>> --
>> Gianmarco
>>
>> On 10 September 2015 at 08:49, Ercan Öztürk <e....@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Thank you very much for your quick response.
>>>
>>> We were using an older version of SAMOA. I've updated the code now (The
>>> last commit is currently "SAMOA-29: Excluding the samoa-storm.properties at
>>> compile time and including at test") and after building the code with "mvn
>>> package" the new command we use to run SAMOA is
>>>
>>> local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar
>>> "PrequentialEvaluation -i -1 -f 41920 -l
>>> (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s
>>> (org.apache.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"
>>>
>>> The console output when the command is run:
>>>
>>> bin/samoa
>>> Deploying to LOCAL
>>> Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
>>> (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s
>>> (org.apache.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)
>>> 2015-09-09 15:56:30,036 [main] INFO  org.apache.samoa.LocalDoTask
>>> (LocalDoTask.java:80) - Successfully instantiating
>>> org.apache.samoa.tasks.PrequentialEvaluation
>>> 2015-09-09 15:56:31,221 [main] INFO
>>>  org.apache.samoa.evaluation.EvaluatorProcessor
>>> (EvaluatorProcessor.java:83) - 1 seconds for 41920 instances
>>> 2015-09-09 15:56:31,227 [main] INFO
>>>  org.apache.samoa.evaluation.EvaluatorProcessor
>>> (EvaluatorProcessor.java:169) - evaluation instances = 41,920
>>> classified instances = 41,920
>>> classifications correct (percent) = 99.988
>>> Kappa Statistic (percent) = -0.002
>>> Kappa Temporal Statistic (percent) = 28.571
>>> Exception in thread "main" java.lang.NullPointerException
>>> at
>>> org.apache.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:142)
>>> at
>>> org.apache.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:72)
>>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:61)
>>> at
>>> org.apache.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:93)
>>> at
>>> org.apache.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:72)
>>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:61)
>>> at
>>> org.apache.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:45)
>>> at
>>> org.apache.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:63)
>>> at
>>> org.apache.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:44)
>>> at
>>> org.apache.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)
>>> at org.apache.samoa.LocalDoTask.main(LocalDoTask.java:88)
>>>
>>>
>>> We would be very appreciated if you could send us the link for the
>>> ticket so we can follow the updates on the issue.
>>>
>>> Yes, we would like to dump the model so that we can see the rules of the
>>> model and have a better understanding of it.
>>>
>>> The method body of describeSubtree() in Node.java is currently empty. Is
>>> there any work done on it that we can use as a starting point?
>>>
>>> If you need the data set to investigate the issue, I can send it via any
>>> suitable channel, please let me know.
>>>
>>> Respectfully,
>>> Ercan Ozturk
>>>
>>> 2015-09-09 15:11 GMT+03:00 Gianmarco De Francisci Morales <
>>> gdfm@apache.org>:
>>>
>>>> Hi,
>>>>
>>>> Thanks for reporting the bug.
>>>> I'm not sure what is causing the issue.
>>>> Are you using the master version of SAMOA?
>>>> My line 145 of ModelAggregator is:
>>>>               this.sendToAttributeStream(abce[i]);
>>>>
>>>> From what you say it seems that the problem is a bit above, and
>>>> leafNode is null.
>>>> However, by construction there should always be a leaf node.
>>>>
>>>> As a workaround your solution is fine, but I guess there is some other
>>>> underlying problem with the code, which might cause some loss in accuracy.
>>>> We should investigate this issue further, I'll open a ticket.
>>>>
>>>> Regarding fetching the content of the model, we had some prototype
>>>> model dumper code (Arinto had started it), but I guess it's not working
>>>> anymore. See the describeSubtree() method in Node.java.
>>>> So unfortunately you need to do it yourself. However, the good thing is
>>>> that the tree model is in a single place in ModelAggregator, so it should
>>>> be relatively easy to walk the tree, starting from the root node.
>>>> Do you want to dump the model to a text representation for human
>>>> inspection?
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> --
>>>> Gianmarco
>>>>
>>>> On 7 September 2015 at 18:23, Gianmarco De Francisci Morales <
>>>> gdfm@apache.org> wrote:
>>>>
>>>>> Forwarding to the @dev list.
>>>>> --
>>>>> Gianmarco
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: Ercan Öztürk <e....@gmail.com>
>>>>> Date: 7 September 2015 at 16:57
>>>>> Subject: HoeffdingTree and VerticalHoeffdingTree Classifiers run on
>>>>> KDD Cup 99 Data Set
>>>>> To: gdfm@apache.org
>>>>>
>>>>>
>>>>> Hi Mr. Morales and Mr. Bifet,
>>>>>
>>>>> We are a couple of undergrad students from TOBB University. As a data
>>>>> mining class project, we decided to run HoeffdingTree classifier-in moa and
>>>>> VerticalHoeffdingTree classifier-in samoa on KDD Cup 99 data set (couldn't
>>>>> attach the data set to this mail due to the size limitations of the Apache
>>>>> mail server) and present the results in our project report.
>>>>>
>>>>> We were able to run HoeffdingTree Algorithm on the KDD Cup 99 (both on
>>>>> kddcup_full.arff, kddcup_10_percent.arff) data set.
>>>>> VerticalHoeffdingTree classifier also works fine on
>>>>> kddcup_10_percent.arff. However, when we try to run the
>>>>> VerticalHoeffdingTree classifier on kddcup_full.arff, we got the
>>>>> following error:
>>>>>
>>>>> The command we use to run SAMOA Local:
>>>>>
>>>>> bin/samoa local target/SAMOA-Local-0.3.0-SNAPSHOT.jar
>>>>> "PrequentialEvaluation -i -1 -f 41920 -l
>>>>> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
>>>>> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"
>>>>>
>>>>> The console output of samoa:
>>>>>
>>>>> bin/samoa
>>>>>
>>>>> Deploying to LOCAL
>>>>>
>>>>> Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
>>>>> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
>>>>> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)
>>>>>
>>>>> 2015-09-01 22:22:16,160 [main] INFO  com.yahoo.labs.samoa.LocalDoTask
>>>>> (LocalDoTask.java:79) - Successfully instantiating
>>>>> com.yahoo.labs.samoa.tasks.PrequentialEvaluation
>>>>>
>>>>> 2015-09-01 22:22:17,741 [main] INFO
>>>>>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
>>>>> (EvaluatorProcessor.java:86) - 1 seconds for 41920 instances
>>>>>
>>>>> 2015-09-01 22:22:17,760 [main] INFO
>>>>>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
>>>>> (EvaluatorProcessor.java:172) - evaluation instances = 41,920
>>>>>
>>>>> classified instances = 41,920
>>>>>
>>>>> classifications correct (percent) = 99.988
>>>>>
>>>>> Kappa Statistic (percent) = -0.002
>>>>>
>>>>> Kappa Temporal Statistic (percent) = 28.571
>>>>>
>>>>> Exception in thread "main" java.lang.NullPointerException
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:145)
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:95)
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:46)
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:66)
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:42)
>>>>>
>>>>> at
>>>>> com.yahoo.labs.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)
>>>>>
>>>>> at com.yahoo.labs.samoa.LocalDoTask.main(LocalDoTask.java:87)
>>>>>
>>>>>
>>>>> We were able to track down the problem to the first instance that
>>>>> causes it; the instance is on the 76426th line in kddcup_full.arff.
>>>>> The instance is as follows:
>>>>>
>>>>>
>>>>> 1,tcp,smtp,SF,2252,331,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,7,0,0,0,0,1,0,1,5,216,1,0,0.2,0.01,0,0,0,0,normal
>>>>>
>>>>> We haven’t noticed any differences between the problematic instance
>>>>> and the other instances. Could you lead us to the root of the problem and
>>>>> could you help us on how to overcome this problem?
>>>>>
>>>>> As a workaround we’ve made the following addition to
>>>>> ModelAggregatorProcessor.java
>>>>>
>>>>> if (leafNode == null)
>>>>>
>>>>>         return false;
>>>>>
>>>>> after the line
>>>>>
>>>>> ActiveLearningNode leafNode = (ActiveLearningNode) foundNode.getNode();
>>>>>
>>>>> Now, also VeriticalHoeffdingTree Classifier works fine on kddcup_full.arff.
>>>>> Is this solution acceptable for the problem, what do you think?
>>>>>
>>>>>
>>>>> Besides, we were wondering how we could fetch model contents such as
>>>>> visiting nodes and node content etc.
>>>>>
>>>>> Thanks for your help,
>>>>>
>>>>>
>>>>> Respectfully,
>>>>>
>>>>> Ercan Ozturk, Davut Deniz Yavuz, Gozde Boztepe, Sezin Gurkan
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD Cup 99 Data Set

Posted by Davut Deniz Yavuz <da...@gmail.com>.
Hi,

We have implemented a recursive method to dump the tree by implementing the
describeSubTree methods in  ActiveLearningNode and SplitNode. (We checked
the tree and it only consists of these types of nodes.) We print the tree
by calling the describeSubTree function of the root of the tree.

Unfortunately, we couldn’t decide where to call the
treeRoot.describeSubTree function. Currently, we are calling it at the end
of the attemptToSplit
<https://github.com/apache/incubator-samoa/blob/9b178f63152e5b4c262e0f3ed28e77667832fc98/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L463>
function.
Therefore, the code prints a couple of trees and we are not sure if it
prints the final form of the tree. Which place would be more appropriate to
call this function to print just the final form of the tree?

--

We debugged the NullPointerException and we’ve found out that the source of
the problem is in the filterInstanceToLeaf method in SplitNode class. The
method returns new FoundNode(null, this, childIndex)
<https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/SplitNode.java#L56>
when the child is null. This method is called from ModelAggregatorProcessor
by trainOnInstanceImpl(Instance inst)
<https://github.com/apache/incubator-samoa/blob/9b178f63152e5b4c262e0f3ed28e77667832fc98/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L425>.
Thus, a node returning null when the getNode function of it called is
produced in the 394. line:

FoundNode foundNode = this.treeRoot.filterInstanceToLeaf(inst, null, -1)
<https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L394>.


This node, then, is added to foundNodeSet by the line
this.foundNodeSet.add(foundNode)
<https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L452>.
Because of this, while looping
<https://github.com/apache/incubator-samoa/blob/master/samoa-api/src/main/java/org/apache/samoa/learners/classifiers/trees/ModelAggregatorProcessor.java#L140>over
the elements in the foundNodeSet, the getNode() function of the leafNode
node returns a null node and it causes a NullPointerException.

To solve the problem we checked how moa handles the null node, however, to
our knowledge, there is not an alike structure like foundNodeSet in moa.

--

The KDD Cup 99 Dataset can be downloaded from here
<https://drive.google.com/file/d/0B_XfRoyTW4lJWFF5TWttR0VLTnhCbERrcEtjY2thWGtLUlZv/view?usp=sharing>
.

Respectfully,
Davut Deniz Yavuz, Ercan Ozturk

2015-09-11 9:57 GMT+03:00 Gianmarco De Francisci Morales <gd...@apache.org>:

> Sure, the ticket is SAMOA-44
> <https://issues.apache.org/jira/browse/SAMOA-44>.
>
> Arinto had started the work on model dumping, I don't know what's the
> status there.
> But it should be straightforward to implement a recursive method.
>
> If you could post the dataset somewhere where it is possible to download
> it, it would be great.
> If you want to take a stab at debugging what's going on and provide a
> patch, it would be even better.
>
> Cheers,
>
> --
> Gianmarco
>
> On 10 September 2015 at 08:49, Ercan Öztürk <e....@gmail.com> wrote:
>
>> Hi,
>>
>> Thank you very much for your quick response.
>>
>> We were using an older version of SAMOA. I've updated the code now (The
>> last commit is currently "SAMOA-29: Excluding the samoa-storm.properties at
>> compile time and including at test") and after building the code with "mvn
>> package" the new command we use to run SAMOA is
>>
>> local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar
>> "PrequentialEvaluation -i -1 -f 41920 -l
>> (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s
>> (org.apache.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"
>>
>> The console output when the command is run:
>>
>> bin/samoa
>> Deploying to LOCAL
>> Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
>> (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s
>> (org.apache.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)
>> 2015-09-09 15:56:30,036 [main] INFO  org.apache.samoa.LocalDoTask
>> (LocalDoTask.java:80) - Successfully instantiating
>> org.apache.samoa.tasks.PrequentialEvaluation
>> 2015-09-09 15:56:31,221 [main] INFO
>>  org.apache.samoa.evaluation.EvaluatorProcessor
>> (EvaluatorProcessor.java:83) - 1 seconds for 41920 instances
>> 2015-09-09 15:56:31,227 [main] INFO
>>  org.apache.samoa.evaluation.EvaluatorProcessor
>> (EvaluatorProcessor.java:169) - evaluation instances = 41,920
>> classified instances = 41,920
>> classifications correct (percent) = 99.988
>> Kappa Statistic (percent) = -0.002
>> Kappa Temporal Statistic (percent) = 28.571
>> Exception in thread "main" java.lang.NullPointerException
>> at
>> org.apache.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:142)
>> at
>> org.apache.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:72)
>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:61)
>> at
>> org.apache.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:93)
>> at
>> org.apache.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:72)
>> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:61)
>> at
>> org.apache.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:45)
>> at
>> org.apache.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:63)
>> at
>> org.apache.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:44)
>> at
>> org.apache.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)
>> at org.apache.samoa.LocalDoTask.main(LocalDoTask.java:88)
>>
>>
>> We would be very appreciated if you could send us the link for the ticket
>> so we can follow the updates on the issue.
>>
>> Yes, we would like to dump the model so that we can see the rules of the
>> model and have a better understanding of it.
>>
>> The method body of describeSubtree() in Node.java is currently empty. Is
>> there any work done on it that we can use as a starting point?
>>
>> If you need the data set to investigate the issue, I can send it via any
>> suitable channel, please let me know.
>>
>> Respectfully,
>> Ercan Ozturk
>>
>> 2015-09-09 15:11 GMT+03:00 Gianmarco De Francisci Morales <
>> gdfm@apache.org>:
>>
>>> Hi,
>>>
>>> Thanks for reporting the bug.
>>> I'm not sure what is causing the issue.
>>> Are you using the master version of SAMOA?
>>> My line 145 of ModelAggregator is:
>>>               this.sendToAttributeStream(abce[i]);
>>>
>>> From what you say it seems that the problem is a bit above, and leafNode
>>> is null.
>>> However, by construction there should always be a leaf node.
>>>
>>> As a workaround your solution is fine, but I guess there is some other
>>> underlying problem with the code, which might cause some loss in accuracy.
>>> We should investigate this issue further, I'll open a ticket.
>>>
>>> Regarding fetching the content of the model, we had some prototype model
>>> dumper code (Arinto had started it), but I guess it's not working anymore.
>>> See the describeSubtree() method in Node.java.
>>> So unfortunately you need to do it yourself. However, the good thing is
>>> that the tree model is in a single place in ModelAggregator, so it should
>>> be relatively easy to walk the tree, starting from the root node.
>>> Do you want to dump the model to a text representation for human
>>> inspection?
>>>
>>> Cheers,
>>>
>>>
>>> --
>>> Gianmarco
>>>
>>> On 7 September 2015 at 18:23, Gianmarco De Francisci Morales <
>>> gdfm@apache.org> wrote:
>>>
>>>> Forwarding to the @dev list.
>>>> --
>>>> Gianmarco
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Ercan Öztürk <e....@gmail.com>
>>>> Date: 7 September 2015 at 16:57
>>>> Subject: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD
>>>> Cup 99 Data Set
>>>> To: gdfm@apache.org
>>>>
>>>>
>>>> Hi Mr. Morales and Mr. Bifet,
>>>>
>>>> We are a couple of undergrad students from TOBB University. As a data
>>>> mining class project, we decided to run HoeffdingTree classifier-in moa and
>>>> VerticalHoeffdingTree classifier-in samoa on KDD Cup 99 data set (couldn't
>>>> attach the data set to this mail due to the size limitations of the Apache
>>>> mail server) and present the results in our project report.
>>>>
>>>> We were able to run HoeffdingTree Algorithm on the KDD Cup 99 (both on
>>>> kddcup_full.arff, kddcup_10_percent.arff) data set.
>>>> VerticalHoeffdingTree classifier also works fine on
>>>> kddcup_10_percent.arff. However, when we try to run the
>>>> VerticalHoeffdingTree classifier on kddcup_full.arff, we got the
>>>> following error:
>>>>
>>>> The command we use to run SAMOA Local:
>>>>
>>>> bin/samoa local target/SAMOA-Local-0.3.0-SNAPSHOT.jar
>>>> "PrequentialEvaluation -i -1 -f 41920 -l
>>>> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
>>>> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"
>>>>
>>>> The console output of samoa:
>>>>
>>>> bin/samoa
>>>>
>>>> Deploying to LOCAL
>>>>
>>>> Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
>>>> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
>>>> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)
>>>>
>>>> 2015-09-01 22:22:16,160 [main] INFO  com.yahoo.labs.samoa.LocalDoTask
>>>> (LocalDoTask.java:79) - Successfully instantiating
>>>> com.yahoo.labs.samoa.tasks.PrequentialEvaluation
>>>>
>>>> 2015-09-01 22:22:17,741 [main] INFO
>>>>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
>>>> (EvaluatorProcessor.java:86) - 1 seconds for 41920 instances
>>>>
>>>> 2015-09-01 22:22:17,760 [main] INFO
>>>>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
>>>> (EvaluatorProcessor.java:172) - evaluation instances = 41,920
>>>>
>>>> classified instances = 41,920
>>>>
>>>> classifications correct (percent) = 99.988
>>>>
>>>> Kappa Statistic (percent) = -0.002
>>>>
>>>> Kappa Temporal Statistic (percent) = 28.571
>>>>
>>>> Exception in thread "main" java.lang.NullPointerException
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:145)
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:95)
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:46)
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:66)
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:42)
>>>>
>>>> at
>>>> com.yahoo.labs.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)
>>>>
>>>> at com.yahoo.labs.samoa.LocalDoTask.main(LocalDoTask.java:87)
>>>>
>>>>
>>>> We were able to track down the problem to the first instance that
>>>> causes it; the instance is on the 76426th line in kddcup_full.arff.
>>>> The instance is as follows:
>>>>
>>>>
>>>> 1,tcp,smtp,SF,2252,331,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,7,0,0,0,0,1,0,1,5,216,1,0,0.2,0.01,0,0,0,0,normal
>>>>
>>>> We haven’t noticed any differences between the problematic instance and
>>>> the other instances. Could you lead us to the root of the problem and could
>>>> you help us on how to overcome this problem?
>>>>
>>>> As a workaround we’ve made the following addition to
>>>> ModelAggregatorProcessor.java
>>>>
>>>> if (leafNode == null)
>>>>
>>>>         return false;
>>>>
>>>> after the line
>>>>
>>>> ActiveLearningNode leafNode = (ActiveLearningNode) foundNode.getNode();
>>>>
>>>> Now, also VeriticalHoeffdingTree Classifier works fine on kddcup_full.arff.
>>>> Is this solution acceptable for the problem, what do you think?
>>>>
>>>>
>>>> Besides, we were wondering how we could fetch model contents such as
>>>> visiting nodes and node content etc.
>>>>
>>>> Thanks for your help,
>>>>
>>>>
>>>> Respectfully,
>>>>
>>>> Ercan Ozturk, Davut Deniz Yavuz, Gozde Boztepe, Sezin Gurkan
>>>>
>>>>
>>>
>>
>

Re: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD Cup 99 Data Set

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Sure, the ticket is SAMOA-44
<https://issues.apache.org/jira/browse/SAMOA-44>.

Arinto had started the work on model dumping, I don't know what's the
status there.
But it should be straightforward to implement a recursive method.

If you could post the dataset somewhere where it is possible to download
it, it would be great.
If you want to take a stab at debugging what's going on and provide a
patch, it would be even better.

Cheers,

--
Gianmarco

On 10 September 2015 at 08:49, Ercan Öztürk <e....@gmail.com> wrote:

> Hi,
>
> Thank you very much for your quick response.
>
> We were using an older version of SAMOA. I've updated the code now (The
> last commit is currently "SAMOA-29: Excluding the samoa-storm.properties at
> compile time and including at test") and after building the code with "mvn
> package" the new command we use to run SAMOA is
>
> local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar
> "PrequentialEvaluation -i -1 -f 41920 -l
> (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s
> (org.apache.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"
>
> The console output when the command is run:
>
> bin/samoa
> Deploying to LOCAL
> Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
> (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s
> (org.apache.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)
> 2015-09-09 15:56:30,036 [main] INFO  org.apache.samoa.LocalDoTask
> (LocalDoTask.java:80) - Successfully instantiating
> org.apache.samoa.tasks.PrequentialEvaluation
> 2015-09-09 15:56:31,221 [main] INFO
>  org.apache.samoa.evaluation.EvaluatorProcessor
> (EvaluatorProcessor.java:83) - 1 seconds for 41920 instances
> 2015-09-09 15:56:31,227 [main] INFO
>  org.apache.samoa.evaluation.EvaluatorProcessor
> (EvaluatorProcessor.java:169) - evaluation instances = 41,920
> classified instances = 41,920
> classifications correct (percent) = 99.988
> Kappa Statistic (percent) = -0.002
> Kappa Temporal Statistic (percent) = 28.571
> Exception in thread "main" java.lang.NullPointerException
> at
> org.apache.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:142)
> at
> org.apache.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:72)
> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:61)
> at
> org.apache.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:93)
> at
> org.apache.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:72)
> at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:61)
> at
> org.apache.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:45)
> at
> org.apache.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:63)
> at
> org.apache.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:44)
> at
> org.apache.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)
> at org.apache.samoa.LocalDoTask.main(LocalDoTask.java:88)
>
>
> We would be very appreciated if you could send us the link for the ticket
> so we can follow the updates on the issue.
>
> Yes, we would like to dump the model so that we can see the rules of the
> model and have a better understanding of it.
>
> The method body of describeSubtree() in Node.java is currently empty. Is
> there any work done on it that we can use as a starting point?
>
> If you need the data set to investigate the issue, I can send it via any
> suitable channel, please let me know.
>
> Respectfully,
> Ercan Ozturk
>
> 2015-09-09 15:11 GMT+03:00 Gianmarco De Francisci Morales <gdfm@apache.org
> >:
>
>> Hi,
>>
>> Thanks for reporting the bug.
>> I'm not sure what is causing the issue.
>> Are you using the master version of SAMOA?
>> My line 145 of ModelAggregator is:
>>               this.sendToAttributeStream(abce[i]);
>>
>> From what you say it seems that the problem is a bit above, and leafNode
>> is null.
>> However, by construction there should always be a leaf node.
>>
>> As a workaround your solution is fine, but I guess there is some other
>> underlying problem with the code, which might cause some loss in accuracy.
>> We should investigate this issue further, I'll open a ticket.
>>
>> Regarding fetching the content of the model, we had some prototype model
>> dumper code (Arinto had started it), but I guess it's not working anymore.
>> See the describeSubtree() method in Node.java.
>> So unfortunately you need to do it yourself. However, the good thing is
>> that the tree model is in a single place in ModelAggregator, so it should
>> be relatively easy to walk the tree, starting from the root node.
>> Do you want to dump the model to a text representation for human
>> inspection?
>>
>> Cheers,
>>
>>
>> --
>> Gianmarco
>>
>> On 7 September 2015 at 18:23, Gianmarco De Francisci Morales <
>> gdfm@apache.org> wrote:
>>
>>> Forwarding to the @dev list.
>>> --
>>> Gianmarco
>>>
>>> ---------- Forwarded message ----------
>>> From: Ercan Öztürk <e....@gmail.com>
>>> Date: 7 September 2015 at 16:57
>>> Subject: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD
>>> Cup 99 Data Set
>>> To: gdfm@apache.org
>>>
>>>
>>> Hi Mr. Morales and Mr. Bifet,
>>>
>>> We are a couple of undergrad students from TOBB University. As a data
>>> mining class project, we decided to run HoeffdingTree classifier-in moa and
>>> VerticalHoeffdingTree classifier-in samoa on KDD Cup 99 data set (couldn't
>>> attach the data set to this mail due to the size limitations of the Apache
>>> mail server) and present the results in our project report.
>>>
>>> We were able to run HoeffdingTree Algorithm on the KDD Cup 99 (both on
>>> kddcup_full.arff, kddcup_10_percent.arff) data set.
>>> VerticalHoeffdingTree classifier also works fine on
>>> kddcup_10_percent.arff. However, when we try to run the
>>> VerticalHoeffdingTree classifier on kddcup_full.arff, we got the
>>> following error:
>>>
>>> The command we use to run SAMOA Local:
>>>
>>> bin/samoa local target/SAMOA-Local-0.3.0-SNAPSHOT.jar
>>> "PrequentialEvaluation -i -1 -f 41920 -l
>>> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
>>> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"
>>>
>>> The console output of samoa:
>>>
>>> bin/samoa
>>>
>>> Deploying to LOCAL
>>>
>>> Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
>>> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
>>> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)
>>>
>>> 2015-09-01 22:22:16,160 [main] INFO  com.yahoo.labs.samoa.LocalDoTask
>>> (LocalDoTask.java:79) - Successfully instantiating
>>> com.yahoo.labs.samoa.tasks.PrequentialEvaluation
>>>
>>> 2015-09-01 22:22:17,741 [main] INFO
>>>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
>>> (EvaluatorProcessor.java:86) - 1 seconds for 41920 instances
>>>
>>> 2015-09-01 22:22:17,760 [main] INFO
>>>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
>>> (EvaluatorProcessor.java:172) - evaluation instances = 41,920
>>>
>>> classified instances = 41,920
>>>
>>> classifications correct (percent) = 99.988
>>>
>>> Kappa Statistic (percent) = -0.002
>>>
>>> Kappa Temporal Statistic (percent) = 28.571
>>>
>>> Exception in thread "main" java.lang.NullPointerException
>>>
>>> at
>>> com.yahoo.labs.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:145)
>>>
>>> at
>>> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>>
>>> at
>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>>>
>>> at
>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>>>
>>> at
>>> com.yahoo.labs.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:95)
>>>
>>> at
>>> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>>
>>> at
>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>>>
>>> at
>>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>>>
>>> at
>>> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:46)
>>>
>>> at
>>> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:66)
>>>
>>> at
>>> com.yahoo.labs.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:42)
>>>
>>> at
>>> com.yahoo.labs.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)
>>>
>>> at com.yahoo.labs.samoa.LocalDoTask.main(LocalDoTask.java:87)
>>>
>>>
>>> We were able to track down the problem to the first instance that causes
>>> it; the instance is on the 76426th line in kddcup_full.arff. The
>>> instance is as follows:
>>>
>>>
>>> 1,tcp,smtp,SF,2252,331,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,7,0,0,0,0,1,0,1,5,216,1,0,0.2,0.01,0,0,0,0,normal
>>>
>>> We haven’t noticed any differences between the problematic instance and
>>> the other instances. Could you lead us to the root of the problem and could
>>> you help us on how to overcome this problem?
>>>
>>> As a workaround we’ve made the following addition to
>>> ModelAggregatorProcessor.java
>>>
>>> if (leafNode == null)
>>>
>>>         return false;
>>>
>>> after the line
>>>
>>> ActiveLearningNode leafNode = (ActiveLearningNode) foundNode.getNode();
>>>
>>> Now, also VeriticalHoeffdingTree Classifier works fine on kddcup_full.arff.
>>> Is this solution acceptable for the problem, what do you think?
>>>
>>>
>>> Besides, we were wondering how we could fetch model contents such as
>>> visiting nodes and node content etc.
>>>
>>> Thanks for your help,
>>>
>>>
>>> Respectfully,
>>>
>>> Ercan Ozturk, Davut Deniz Yavuz, Gozde Boztepe, Sezin Gurkan
>>>
>>>
>>
>

Re: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD Cup 99 Data Set

Posted by Ercan Öztürk <e....@gmail.com>.
Hi,

Thank you very much for your quick response.

We were using an older version of SAMOA. I've updated the code now (The
last commit is currently "SAMOA-29: Excluding the samoa-storm.properties at
compile time and including at test") and after building the code with "mvn
package" the new command we use to run SAMOA is

local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar
"PrequentialEvaluation -i -1 -f 41920 -l
(org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s
(org.apache.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"

The console output when the command is run:

bin/samoa
Deploying to LOCAL
Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
(org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s
(org.apache.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)
2015-09-09 15:56:30,036 [main] INFO  org.apache.samoa.LocalDoTask
(LocalDoTask.java:80) - Successfully instantiating
org.apache.samoa.tasks.PrequentialEvaluation
2015-09-09 15:56:31,221 [main] INFO
 org.apache.samoa.evaluation.EvaluatorProcessor
(EvaluatorProcessor.java:83) - 1 seconds for 41920 instances
2015-09-09 15:56:31,227 [main] INFO
 org.apache.samoa.evaluation.EvaluatorProcessor
(EvaluatorProcessor.java:169) - evaluation instances = 41,920
classified instances = 41,920
classifications correct (percent) = 99.988
Kappa Statistic (percent) = -0.002
Kappa Temporal Statistic (percent) = 28.571
Exception in thread "main" java.lang.NullPointerException
at
org.apache.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:142)
at
org.apache.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:72)
at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:61)
at
org.apache.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:93)
at
org.apache.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:72)
at org.apache.samoa.topology.impl.SimpleStream.put(SimpleStream.java:61)
at
org.apache.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:45)
at
org.apache.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:63)
at org.apache.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:44)
at
org.apache.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)
at org.apache.samoa.LocalDoTask.main(LocalDoTask.java:88)


We would be very appreciated if you could send us the link for the ticket
so we can follow the updates on the issue.

Yes, we would like to dump the model so that we can see the rules of the
model and have a better understanding of it.

The method body of describeSubtree() in Node.java is currently empty. Is
there any work done on it that we can use as a starting point?

If you need the data set to investigate the issue, I can send it via any
suitable channel, please let me know.

Respectfully,
Ercan Ozturk

2015-09-09 15:11 GMT+03:00 Gianmarco De Francisci Morales <gd...@apache.org>:

> Hi,
>
> Thanks for reporting the bug.
> I'm not sure what is causing the issue.
> Are you using the master version of SAMOA?
> My line 145 of ModelAggregator is:
>               this.sendToAttributeStream(abce[i]);
>
> From what you say it seems that the problem is a bit above, and leafNode
> is null.
> However, by construction there should always be a leaf node.
>
> As a workaround your solution is fine, but I guess there is some other
> underlying problem with the code, which might cause some loss in accuracy.
> We should investigate this issue further, I'll open a ticket.
>
> Regarding fetching the content of the model, we had some prototype model
> dumper code (Arinto had started it), but I guess it's not working anymore.
> See the describeSubtree() method in Node.java.
> So unfortunately you need to do it yourself. However, the good thing is
> that the tree model is in a single place in ModelAggregator, so it should
> be relatively easy to walk the tree, starting from the root node.
> Do you want to dump the model to a text representation for human
> inspection?
>
> Cheers,
>
>
> --
> Gianmarco
>
> On 7 September 2015 at 18:23, Gianmarco De Francisci Morales <
> gdfm@apache.org> wrote:
>
>> Forwarding to the @dev list.
>> --
>> Gianmarco
>>
>> ---------- Forwarded message ----------
>> From: Ercan Öztürk <e....@gmail.com>
>> Date: 7 September 2015 at 16:57
>> Subject: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD
>> Cup 99 Data Set
>> To: gdfm@apache.org
>>
>>
>> Hi Mr. Morales and Mr. Bifet,
>>
>> We are a couple of undergrad students from TOBB University. As a data
>> mining class project, we decided to run HoeffdingTree classifier-in moa and
>> VerticalHoeffdingTree classifier-in samoa on KDD Cup 99 data set (couldn't
>> attach the data set to this mail due to the size limitations of the Apache
>> mail server) and present the results in our project report.
>>
>> We were able to run HoeffdingTree Algorithm on the KDD Cup 99 (both on
>> kddcup_full.arff, kddcup_10_percent.arff) data set.
>> VerticalHoeffdingTree classifier also works fine on
>> kddcup_10_percent.arff. However, when we try to run the
>> VerticalHoeffdingTree classifier on kddcup_full.arff, we got the
>> following error:
>>
>> The command we use to run SAMOA Local:
>>
>> bin/samoa local target/SAMOA-Local-0.3.0-SNAPSHOT.jar
>> "PrequentialEvaluation -i -1 -f 41920 -l
>> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
>> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"
>>
>> The console output of samoa:
>>
>> bin/samoa
>>
>> Deploying to LOCAL
>>
>> Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
>> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
>> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)
>>
>> 2015-09-01 22:22:16,160 [main] INFO  com.yahoo.labs.samoa.LocalDoTask
>> (LocalDoTask.java:79) - Successfully instantiating
>> com.yahoo.labs.samoa.tasks.PrequentialEvaluation
>>
>> 2015-09-01 22:22:17,741 [main] INFO
>>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
>> (EvaluatorProcessor.java:86) - 1 seconds for 41920 instances
>>
>> 2015-09-01 22:22:17,760 [main] INFO
>>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
>> (EvaluatorProcessor.java:172) - evaluation instances = 41,920
>>
>> classified instances = 41,920
>>
>> classifications correct (percent) = 99.988
>>
>> Kappa Statistic (percent) = -0.002
>>
>> Kappa Temporal Statistic (percent) = 28.571
>>
>> Exception in thread "main" java.lang.NullPointerException
>>
>> at
>> com.yahoo.labs.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:145)
>>
>> at
>> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>
>> at
>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>>
>> at
>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>>
>> at
>> com.yahoo.labs.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:95)
>>
>> at
>> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>>
>> at
>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>>
>> at
>> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>>
>> at
>> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:46)
>>
>> at
>> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:66)
>>
>> at
>> com.yahoo.labs.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:42)
>>
>> at
>> com.yahoo.labs.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)
>>
>> at com.yahoo.labs.samoa.LocalDoTask.main(LocalDoTask.java:87)
>>
>>
>> We were able to track down the problem to the first instance that causes
>> it; the instance is on the 76426th line in kddcup_full.arff. The
>> instance is as follows:
>>
>>
>> 1,tcp,smtp,SF,2252,331,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,7,0,0,0,0,1,0,1,5,216,1,0,0.2,0.01,0,0,0,0,normal
>>
>> We haven’t noticed any differences between the problematic instance and
>> the other instances. Could you lead us to the root of the problem and could
>> you help us on how to overcome this problem?
>>
>> As a workaround we’ve made the following addition to
>> ModelAggregatorProcessor.java
>>
>> if (leafNode == null)
>>
>>         return false;
>>
>> after the line
>>
>> ActiveLearningNode leafNode = (ActiveLearningNode) foundNode.getNode();
>>
>> Now, also VeriticalHoeffdingTree Classifier works fine on kddcup_full.arff.
>> Is this solution acceptable for the problem, what do you think?
>>
>>
>> Besides, we were wondering how we could fetch model contents such as
>> visiting nodes and node content etc.
>>
>> Thanks for your help,
>>
>>
>> Respectfully,
>>
>> Ercan Ozturk, Davut Deniz Yavuz, Gozde Boztepe, Sezin Gurkan
>>
>>
>

Re: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD Cup 99 Data Set

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Hi,

Thanks for reporting the bug.
I'm not sure what is causing the issue.
Are you using the master version of SAMOA?
My line 145 of ModelAggregator is:
              this.sendToAttributeStream(abce[i]);

>From what you say it seems that the problem is a bit above, and leafNode is
null.
However, by construction there should always be a leaf node.

As a workaround your solution is fine, but I guess there is some other
underlying problem with the code, which might cause some loss in accuracy.
We should investigate this issue further, I'll open a ticket.

Regarding fetching the content of the model, we had some prototype model
dumper code (Arinto had started it), but I guess it's not working anymore.
See the describeSubtree() method in Node.java.
So unfortunately you need to do it yourself. However, the good thing is
that the tree model is in a single place in ModelAggregator, so it should
be relatively easy to walk the tree, starting from the root node.
Do you want to dump the model to a text representation for human inspection?

Cheers,


--
Gianmarco

On 7 September 2015 at 18:23, Gianmarco De Francisci Morales <
gdfm@apache.org> wrote:

> Forwarding to the @dev list.
> --
> Gianmarco
>
> ---------- Forwarded message ----------
> From: Ercan Öztürk <e....@gmail.com>
> Date: 7 September 2015 at 16:57
> Subject: HoeffdingTree and VerticalHoeffdingTree Classifiers run on KDD
> Cup 99 Data Set
> To: gdfm@apache.org
>
>
> Hi Mr. Morales and Mr. Bifet,
>
> We are a couple of undergrad students from TOBB University. As a data
> mining class project, we decided to run HoeffdingTree classifier-in moa and
> VerticalHoeffdingTree classifier-in samoa on KDD Cup 99 data set (couldn't
> attach the data set to this mail due to the size limitations of the Apache
> mail server) and present the results in our project report.
>
> We were able to run HoeffdingTree Algorithm on the KDD Cup 99 (both on
> kddcup_full.arff, kddcup_10_percent.arff) data set. VerticalHoeffdingTree
> classifier also works fine on kddcup_10_percent.arff. However, when we
> try to run the VerticalHoeffdingTree classifier on kddcup_full.arff, we
> got the following error:
>
> The command we use to run SAMOA Local:
>
> bin/samoa local target/SAMOA-Local-0.3.0-SNAPSHOT.jar
> "PrequentialEvaluation -i -1 -f 41920 -l
> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)"
>
> The console output of samoa:
>
> bin/samoa
>
> Deploying to LOCAL
>
> Command line string =  PrequentialEvaluation -i -1 -f 41920 -l
> (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
> 4) -s (com.yahoo.labs.samoa.moa.streams.ArffFileStream -f kddcup_full.arff)
>
> 2015-09-01 22:22:16,160 [main] INFO  com.yahoo.labs.samoa.LocalDoTask
> (LocalDoTask.java:79) - Successfully instantiating
> com.yahoo.labs.samoa.tasks.PrequentialEvaluation
>
> 2015-09-01 22:22:17,741 [main] INFO
>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
> (EvaluatorProcessor.java:86) - 1 seconds for 41920 instances
>
> 2015-09-01 22:22:17,760 [main] INFO
>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
> (EvaluatorProcessor.java:172) - evaluation instances = 41,920
>
> classified instances = 41,920
>
> classifications correct (percent) = 99.988
>
> Kappa Statistic (percent) = -0.002
>
> Kappa Temporal Statistic (percent) = 28.571
>
> Exception in thread "main" java.lang.NullPointerException
>
> at
> com.yahoo.labs.samoa.learners.classifiers.trees.ModelAggregatorProcessor.process(ModelAggregatorProcessor.java:145)
>
> at
> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>
> at
> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>
> at
> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>
> at
> com.yahoo.labs.samoa.learners.classifiers.trees.FilterProcessor.process(FilterProcessor.java:95)
>
> at
> com.yahoo.labs.samoa.topology.impl.SimpleProcessingItem.processEvent(SimpleProcessingItem.java:84)
>
> at
> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:71)
>
> at
> com.yahoo.labs.samoa.topology.impl.SimpleStream.put(SimpleStream.java:60)
>
> at
> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.injectNextEvent(LocalEntranceProcessingItem.java:46)
>
> at
> com.yahoo.labs.samoa.topology.LocalEntranceProcessingItem.startSendingEvents(LocalEntranceProcessingItem.java:66)
>
> at
> com.yahoo.labs.samoa.topology.impl.SimpleTopology.run(SimpleTopology.java:42)
>
> at
> com.yahoo.labs.samoa.topology.impl.SimpleEngine.submitTopology(SimpleEngine.java:33)
>
> at com.yahoo.labs.samoa.LocalDoTask.main(LocalDoTask.java:87)
>
>
> We were able to track down the problem to the first instance that causes
> it; the instance is on the 76426th line in kddcup_full.arff. The instance
> is as follows:
>
>
> 1,tcp,smtp,SF,2252,331,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,7,0,0,0,0,1,0,1,5,216,1,0,0.2,0.01,0,0,0,0,normal
>
> We haven’t noticed any differences between the problematic instance and
> the other instances. Could you lead us to the root of the problem and could
> you help us on how to overcome this problem?
>
> As a workaround we’ve made the following addition to
> ModelAggregatorProcessor.java
>
> if (leafNode == null)
>
>         return false;
>
> after the line
>
> ActiveLearningNode leafNode = (ActiveLearningNode) foundNode.getNode();
>
> Now, also VeriticalHoeffdingTree Classifier works fine on kddcup_full.arff.
> Is this solution acceptable for the problem, what do you think?
>
>
> Besides, we were wondering how we could fetch model contents such as
> visiting nodes and node content etc.
>
> Thanks for your help,
>
>
> Respectfully,
>
> Ercan Ozturk, Davut Deniz Yavuz, Gozde Boztepe, Sezin Gurkan
>
>