You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spot.apache.org by Christos Minas Mathas <ma...@gmail.com> on 2018/01/23 16:35:48 UTC

Spot-ml parameters configuration

Hi,

I've been evaluating the Netflow component of Spot for quite some time
now by using different kinds of attacks and collect the results. I'm
using the default configuration, I haven't changed any of the parameters
and the results I'm getting are not good. I was reading in the users
mailing list some responses from Gustavo Lujan Moreno back in June 2017
in which he said about the results they're getting:
/
//"On proxy we are getting > 0.90 on AUC and on net flow >0.99."/

http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E

My results in terms of AUROC are more like ~0.52 or worse.

He also gave some tips about configuring the parameters of spot-ml. So I
thought I'd try them.

"/. . ."--ldamaxiterations 20” is the iteration parameter. You should
change that 20 for something higher, at least 100, ideally +200.//
//. . .//
//If you are not getting good results the number of iterations and
topics should be your priority./"

http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E

1. I changed ldamaxiterations to 200 but after running for ~60000 stages
and 2 and a half hours there wasn't enough RAM in one of the associated
VMs and ml_ops exited with a StackOverflowException. So I assigned 32GB
of RAM to each one of the three VMs associated and this time it stopped
at ~20000 stages again with a StackOverflow from another one of the
associated VMs. How much RAM would I need for 200 iterations and for
which services?

2. Can someone explain how can I properly configure the parameters of
spot-ml? Like for the topic count for example, how can I calculate an
approximate value of topics based on the traffic and the network setup?

If you need further information on my setup or the results I'm getting
just let me know.

Thanks in advance

Re: Spot-ml parameters configuration

Posted by Christos Mathas <ma...@gmail.com>.

Ok, so I got some screenshots, if any more are needed just tell me which.

The spot.conf file:

https://www.dropbox.com/s/uglbdqflivmtytk/spot.conf.png?dl=0

The hosts in the cluster and their roles:

https://www.dropbox.com/s/2dtkvpjmz9lyrs5/cloudera%201.png?dl=0

https://www.dropbox.com/s/4kx5u85jignfq8c/cloudera%202.png?dl=0

https://www.dropbox.com/s/xsqrb4gew0ujezl/cloudera%20manager.png?dl=0

Also I found out some errors that I think occured after assigning more 
RAM to the hosts but I can't try to resolve them right now because there 
is some maintenance going on these days and I can't access the VMware 
ESXi interface. I got screenshots of these too:

https://www.dropbox.com/s/pmwwxtlrcro3jax/history%20server%20error.png?dl=0

https://www.dropbox.com/s/kmnn6oz04f57vr0/memory%20warning.png?dl=0


On 01/23/2018 07:46 PM, Ricardo Barona wrote:
> Got it. Let's see what is in your configuration file and how is your 
> cluster configured to get the most of it.
>
> On Tue, Jan 23, 2018 at 11:40 AM, Christos Minas Mathas 
> <mathas.ch.m@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi Ricardo,
>
>     first of all thank you for your answer. What you said on the first
>     topic reminded me of something really important I forgot to mention:
>
>     I used the exact same Netflow data for the two executions of
>     ml_ops. The only thing I changed was that I increased the amount
>     of RAM as described. I was monitoring the VMs with htop during
>     both executions and the CPU/RAM behavior was totally different
>     between the two executions which is already obvious by the fact
>     that the first one failed at ~60000 stages and the second one with
>     more RAM failed at ~20000 stages.
>
>     As you say for now all I can do is keep playing with the
>     parameters and see what happens. However, the reason I'm writing
>     to the list is I'm currently writing my undergraduate thesis on
>     the evaluation of Apache Spot by using penetration testing
>     techniques, and I'm trying to get a better handle on how it works
>     and hopefully make it give me some better results.
>
>
>     On 01/23/2018 07:12 PM, Ricardo Barona wrote:
>>     Hi Christos,
>>
>>     Here are my thoughts about your questions.
>>
>>     1. From my experience, working with memory in Apache Spark and
>>     therefore in Apache Spot, you need to know how your data is
>>     distributed, the size of your files and the number of files you
>>     are trying to process. The reasons to get out of memory error can
>>     be for many things but one of the most commons reasons is the
>>     number of topics.
>>     When using default 20, for each row in your data set, using
>>     NetFlow data you are going to add 4 vectors of 20 Doubles each.
>>     One vector is for the 20 probabilities over topic for source IP,
>>     another vector is for 20 probabilities over topic for destination
>>     IP, one more for source word and another for destination word. If
>>     you are running DNS or Proxy the payload is half as we only
>>     analyse one IP and one word (word is composed with information of
>>     the same row).
>>     Given that, you need to make sure you executors can fit that
>>     amount of data or reduce the number of topic. There is no
>>     concrete answer (sadly) other than play and get it to work.
>>
>>     2. Number of topics is another discussion that often is worthy
>>     for its own papers/discussions in many forums, but from what I've
>>     seen in the past the number of topics in Apache Spot can even be
>>     something as small as 5 and the results are going to be similar
>>     to what you get with 20. I don't have the documents to backup
>>     that but you can try your self and see if what I'm saying is
>>     correct. Again, number of topics as well as hyper parameters
>>     tuning is something you need to play with before you get the best
>>     solution.
>>
>>     Sorry I can't provide more information, as a Software Engineer
>>     that's as much as I can take from my past conversations with Data
>>     Scientist (like Gustavo). If you'd create a list of parameters
>>     you want more information please reply with the same and I'll
>>     give you my input from what I have seen.
>>
>>     Thanks!
>>
>>
>>     On Tue, Jan 23, 2018 at 10:35 AM, Christos Minas Mathas
>>     <mathas.ch.m@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hi,
>>
>>         I've been evaluating the Netflow component of Spot for quite
>>         some time now by using different kinds of attacks and collect
>>         the results. I'm using the default configuration, I haven't
>>         changed any of the parameters and the results I'm getting are
>>         not good. I was reading in the users mailing list some
>>         responses from Gustavo Lujan Moreno back in June 2017 in
>>         which he said about the results they're getting:
>>         /
>>         //"On proxy we are getting > 0.90 on AUC and on net flow >0.99."/
>>
>>         http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E
>>         <http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E>
>>
>>         My results in terms of AUROC are more like ~0.52 or worse.
>>
>>         He also gave some tips about configuring the parameters of
>>         spot-ml. So I thought I'd try them.
>>
>>         "/. . ."--ldamaxiterations 20” is the iteration parameter.
>>         You should change that 20 for something higher, at least 100,
>>         ideally +200.//
>>         //. . .//
>>         //If you are not getting good results the number of
>>         iterations and topics should be your priority./"
>>
>>         http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E
>>         <http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E>
>>
>>         1. I changed ldamaxiterations to 200 but after running for
>>         ~60000 stages and 2 and a half hours there wasn't enough RAM
>>         in one of the associated VMs and ml_ops exited with a
>>         StackOverflowException. So I assigned 32GB of RAM to each one
>>         of the three VMs associated and this time it stopped at
>>         ~20000 stages again with a StackOverflow from another one of
>>         the associated VMs. How much RAM would I need for 200
>>         iterations and for which services?
>>
>>         2. Can someone explain how can I properly configure the
>>         parameters of spot-ml? Like for the topic count for example,
>>         how can I calculate an approximate value of topics based on
>>         the traffic and the network setup?
>>
>>         If you need further information on my setup or the results
>>         I'm getting just let me know.
>>
>>         Thanks in advance
>>
>>
>
>

Re: Spot-ml parameters configuration

Posted by Ricardo Barona <ri...@gmail.com>.

Got it. Let's see what is in your configuration file and how is your
cluster configured to get the most of it.

On Tue, Jan 23, 2018 at 11:40 AM, Christos Minas Mathas <
mathas.ch.m@gmail.com> wrote:

> Hi Ricardo,
>
> first of all thank you for your answer. What you said on the first topic
> reminded me of something really important I forgot to mention:
>
> I used the exact same Netflow data for the two executions of ml_ops. The
> only thing I changed was that I increased the amount of RAM as described. I
> was monitoring the VMs with htop during both executions and the CPU/RAM
> behavior was totally different between the two executions which is already
> obvious by the fact that the first one failed at ~60000 stages and the
> second one with more RAM failed at ~20000 stages.
>
> As you say for now all I can do is keep playing with the parameters and
> see what happens. However, the reason I'm writing to the list is I'm
> currently writing my undergraduate thesis on the evaluation of Apache Spot
> by using penetration testing techniques, and I'm trying to get a better
> handle on how it works and hopefully make it give me some better results.
>
> On 01/23/2018 07:12 PM, Ricardo Barona wrote:
>
> Hi Christos,
>
> Here are my thoughts about your questions.
>
> 1. From my experience, working with memory in Apache Spark and therefore
> in Apache Spot, you need to know how your data is distributed, the size of
> your files and the number of files you are trying to process. The reasons
> to get out of memory error can be for many things but one of the most
> commons reasons is the number of topics.
> When using default 20, for each row in your data set, using NetFlow data
> you are going to add 4 vectors of 20 Doubles each. One vector is for the 20
> probabilities over topic for source IP, another vector is for 20
> probabilities over topic for destination IP, one more for source word and
> another for destination word. If you are running DNS or Proxy the payload
> is half as we only analyse one IP and one word (word is composed with
> information of the same row).
> Given that, you need to make sure you executors can fit that amount of
> data or reduce the number of topic. There is no concrete answer (sadly)
> other than play and get it to work.
>
> 2. Number of topics is another discussion that often is worthy for its own
> papers/discussions in many forums, but from what I've seen in the past the
> number of topics in Apache Spot can even be something as small as 5 and the
> results are going to be similar to what you get with 20. I don't have the
> documents to backup that but you can try your self and see if what I'm
> saying is correct. Again, number of topics as well as hyper parameters
> tuning is something you need to play with before you get the best solution.
>
> Sorry I can't provide more information, as a Software Engineer that's as
> much as I can take from my past conversations with Data Scientist (like
> Gustavo). If you'd create a list of parameters you want more information
> please reply with the same and I'll give you my input from what I have seen.
>
> Thanks!
>
>
> On Tue, Jan 23, 2018 at 10:35 AM, Christos Minas Mathas <
> mathas.ch.m@gmail.com> wrote:
>
>> Hi,
>>
>> I've been evaluating the Netflow component of Spot for quite some time
>> now by using different kinds of attacks and collect the results. I'm using
>> the default configuration, I haven't changed any of the parameters and the
>> results I'm getting are not good. I was reading in the users mailing list
>> some responses from Gustavo Lujan Moreno back in June 2017 in which he said
>> about the results they're getting:
>>
>> *"On proxy we are getting > 0.90 on AUC and on net flow >0.99."*
>>
>> http://mail-archives.apache.org/mod_mbox/spot-user/201706.mb
>> ox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E
>>
>> My results in terms of AUROC are more like ~0.52 or worse.
>>
>> He also gave some tips about configuring the parameters of spot-ml. So I
>> thought I'd try them.
>>
>> "*. . ."--ldamaxiterations 20” is the iteration parameter. You should
>> change that 20 for something higher, at least 100, ideally +200.*
>> *. . .*
>> *If you are not getting good results the number of iterations and topics
>> should be your priority.*"
>>
>> http://mail-archives.apache.org/mod_mbox/spot-user/201706.mb
>> ox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E
>>
>> 1. I changed ldamaxiterations to 200 but after running for ~60000 stages
>> and 2 and a half hours there wasn't enough RAM in one of the associated VMs
>> and ml_ops exited with a StackOverflowException. So I assigned 32GB of RAM
>> to each one of the three VMs associated and this time it stopped at ~20000
>> stages again with a StackOverflow from another one of the associated VMs.
>> How much RAM would I need for 200 iterations and for which services?
>>
>> 2. Can someone explain how can I properly configure the parameters of
>> spot-ml? Like for the topic count for example, how can I calculate an
>> approximate value of topics based on the traffic and the network setup?
>>
>> If you need further information on my setup or the results I'm getting
>> just let me know.
>>
>> Thanks in advance
>>
>
>
>

Re: Spot-ml parameters configuration

Posted by Christos Minas Mathas <ma...@gmail.com>.

Hi Ricardo,

first of all thank you for your answer. What you said on the first topic 
reminded me of something really important I forgot to mention:

I used the exact same Netflow data for the two executions of ml_ops. The 
only thing I changed was that I increased the amount of RAM as 
described. I was monitoring the VMs with htop during both executions and 
the CPU/RAM behavior was totally different between the two executions 
which is already obvious by the fact that the first one failed at ~60000 
stages and the second one with more RAM failed at ~20000 stages.

As you say for now all I can do is keep playing with the parameters and 
see what happens. However, the reason I'm writing to the list is I'm 
currently writing my undergraduate thesis on the evaluation of Apache 
Spot by using penetration testing techniques, and I'm trying to get a 
better handle on how it works and hopefully make it give me some better 
results.


On 01/23/2018 07:12 PM, Ricardo Barona wrote:
> Hi Christos,
>
> Here are my thoughts about your questions.
>
> 1. From my experience, working with memory in Apache Spark and 
> therefore in Apache Spot, you need to know how your data is 
> distributed, the size of your files and the number of files you are 
> trying to process. The reasons to get out of memory error can be for 
> many things but one of the most commons reasons is the number of topics.
> When using default 20, for each row in your data set, using NetFlow 
> data you are going to add 4 vectors of 20 Doubles each. One vector is 
> for the 20 probabilities over topic for source IP, another vector is 
> for 20 probabilities over topic for destination IP, one more for 
> source word and another for destination word. If you are running DNS 
> or Proxy the payload is half as we only analyse one IP and one word 
> (word is composed with information of the same row).
> Given that, you need to make sure you executors can fit that amount of 
> data or reduce the number of topic. There is no concrete answer 
> (sadly) other than play and get it to work.
>
> 2. Number of topics is another discussion that often is worthy for its 
> own papers/discussions in many forums, but from what I've seen in the 
> past the number of topics in Apache Spot can even be something as 
> small as 5 and the results are going to be similar to what you get 
> with 20. I don't have the documents to backup that but you can try 
> your self and see if what I'm saying is correct. Again, number of 
> topics as well as hyper parameters tuning is something you need to 
> play with before you get the best solution.
>
> Sorry I can't provide more information, as a Software Engineer that's 
> as much as I can take from my past conversations with Data Scientist 
> (like Gustavo). If you'd create a list of parameters you want more 
> information please reply with the same and I'll give you my input from 
> what I have seen.
>
> Thanks!
>
>
> On Tue, Jan 23, 2018 at 10:35 AM, Christos Minas Mathas 
> <mathas.ch.m@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I've been evaluating the Netflow component of Spot for quite some
>     time now by using different kinds of attacks and collect the
>     results. I'm using the default configuration, I haven't changed
>     any of the parameters and the results I'm getting are not good. I
>     was reading in the users mailing list some responses from Gustavo
>     Lujan Moreno back in June 2017 in which he said about the results
>     they're getting:
>     /
>     //"On proxy we are getting > 0.90 on AUC and on net flow >0.99."/
>
>     http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E
>     <http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E>
>
>     My results in terms of AUROC are more like ~0.52 or worse.
>
>     He also gave some tips about configuring the parameters of
>     spot-ml. So I thought I'd try them.
>
>     "/. . ."--ldamaxiterations 20” is the iteration parameter. You
>     should change that 20 for something higher, at least 100, ideally
>     +200.//
>     //. . .//
>     //If you are not getting good results the number of iterations and
>     topics should be your priority./"
>
>     http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E
>     <http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E>
>
>     1. I changed ldamaxiterations to 200 but after running for ~60000
>     stages and 2 and a half hours there wasn't enough RAM in one of
>     the associated VMs and ml_ops exited with a
>     StackOverflowException. So I assigned 32GB of RAM to each one of
>     the three VMs associated and this time it stopped at ~20000 stages
>     again with a StackOverflow from another one of the associated VMs.
>     How much RAM would I need for 200 iterations and for which services?
>
>     2. Can someone explain how can I properly configure the parameters
>     of spot-ml? Like for the topic count for example, how can I
>     calculate an approximate value of topics based on the traffic and
>     the network setup?
>
>     If you need further information on my setup or the results I'm
>     getting just let me know.
>
>     Thanks in advance
>
>

Re: Spot-ml parameters configuration

Posted by Ricardo Barona <ri...@gmail.com>.

Hi Christos,

Here are my thoughts about your questions.

1. From my experience, working with memory in Apache Spark and therefore in
Apache Spot, you need to know how your data is distributed, the size of
your files and the number of files you are trying to process. The reasons
to get out of memory error can be for many things but one of the most
commons reasons is the number of topics.
When using default 20, for each row in your data set, using NetFlow data
you are going to add 4 vectors of 20 Doubles each. One vector is for the 20
probabilities over topic for source IP, another vector is for 20
probabilities over topic for destination IP, one more for source word and
another for destination word. If you are running DNS or Proxy the payload
is half as we only analyse one IP and one word (word is composed with
information of the same row).
Given that, you need to make sure you executors can fit that amount of data
or reduce the number of topic. There is no concrete answer (sadly) other
than play and get it to work.

2. Number of topics is another discussion that often is worthy for its own
papers/discussions in many forums, but from what I've seen in the past the
number of topics in Apache Spot can even be something as small as 5 and the
results are going to be similar to what you get with 20. I don't have the
documents to backup that but you can try your self and see if what I'm
saying is correct. Again, number of topics as well as hyper parameters
tuning is something you need to play with before you get the best solution.

Sorry I can't provide more information, as a Software Engineer that's as
much as I can take from my past conversations with Data Scientist (like
Gustavo). If you'd create a list of parameters you want more information
please reply with the same and I'll give you my input from what I have seen.

Thanks!

On Tue, Jan 23, 2018 at 10:35 AM, Christos Minas Mathas <
mathas.ch.m@gmail.com> wrote:

> Hi,
>
> I've been evaluating the Netflow component of Spot for quite some time now
> by using different kinds of attacks and collect the results. I'm using the
> default configuration, I haven't changed any of the parameters and the
> results I'm getting are not good. I was reading in the users mailing list
> some responses from Gustavo Lujan Moreno back in June 2017 in which he said
> about the results they're getting:
>
> *"On proxy we are getting > 0.90 on AUC and on net flow >0.99."*
>
> http://mail-archives.apache.org/mod_mbox/spot-user/201706.
> mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E
>
> My results in terms of AUROC are more like ~0.52 or worse.
>
> He also gave some tips about configuring the parameters of spot-ml. So I
> thought I'd try them.
>
> "*. . ."--ldamaxiterations 20” is the iteration parameter. You should
> change that 20 for something higher, at least 100, ideally +200.*
> *. . .*
> *If you are not getting good results the number of iterations and topics
> should be your priority.*"
>
> http://mail-archives.apache.org/mod_mbox/spot-user/201706.
> mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E
>
> 1. I changed ldamaxiterations to 200 but after running for ~60000 stages
> and 2 and a half hours there wasn't enough RAM in one of the associated VMs
> and ml_ops exited with a StackOverflowException. So I assigned 32GB of RAM
> to each one of the three VMs associated and this time it stopped at ~20000
> stages again with a StackOverflow from another one of the associated VMs.
> How much RAM would I need for 200 iterations and for which services?
>
> 2. Can someone explain how can I properly configure the parameters of
> spot-ml? Like for the topic count for example, how can I calculate an
> approximate value of topics based on the traffic and the network setup?
>
> If you need further information on my setup or the results I'm getting
> just let me know.
>
> Thanks in advance
>

Re: Spot-ml parameters configuration

Posted by "Lujan Moreno, Gustavo" <gu...@intel.com>.

Hi,

“Word” column is the word created for that specific log, it is the input for the LDA. It is really not necessary to compute the AUROC. I left it in my code to have an idea of the words that were being ranked high/low. Just discard it for now as well as the host. You only need the score (probability) and the rank.

Best,

Gustavo


From: Christos Mathas <ma...@gmail.com>
Reply-To: "user@spot.incubator.apache.org" <us...@spot.incubator.apache.org>
Date: Wednesday, January 31, 2018 at 9:58 AM
To: "Lujan Moreno, Gustavo" <gu...@intel.com>, "user@spot.incubator.apache.org" <us...@spot.incubator.apache.org>
Subject: Re: Spot-ml parameters configuration


Hi Gustavo,

unfortunately I have zero knowledge of R programming, but I'll give it a try. So to use this script, I have to create an input file which contains only the attacks that were in the flow_results.csv and each entry should contain the columns host, score, rank, word? Also, what does "word" stand for?

Thank you

On 01/31/2018 05:12 PM, Lujan Moreno, Gustavo wrote:
Hi Christos,

If you are able to run 100 iterations it is fine, you can start analyzing your results. Try to reach 200 iterations as a second priority now.

I have doubts about the way you are computing the AUROC. You said that you are only using the first 100 values of the results. I assume they are ordered and ranked. It is not totally incorrect to use the first n values but that is not the usual AUROC. You need to compute it over all the dataset. In order to do this you don’t have to know the rank of every single row, just the rank of the attacks which we assume are way fewer than normal data. I’m attaching a R script which computes the AUROC given the TP and FP which is also computed in the script. You need to specify the size of your normal dataset + attacks. This is a very efficient way of computing the AUROC because you are only using the ranks of the attacks. The script is designed to read several results with a given file name pattern. You only requirement is to read an input file with columns: host, score, rank, word, although you can modify this. It also plots the ROC.

Give it a try. You may need to modify a couple of lines.

Best,

Gustavo



setwd("/Users/galujanm/Documents/R/Spark21_tuning/proxy2")

#Call needed libraries

library(pROC)

library(ggplot2)

#This is a function to insert row given a location

insertRow <- function(existingDF, newrow, r) {

  existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]

  existingDF[r,] <- newrow

  existingDF

}



totalRows <- 722992037 #Only normal data

totalRows <- totalRows + 200. #I’m injecting 200 attacks in this case



runs <-  list.files(pattern = "spark21_proxy2_4_em*")



masterDF <- data.frame()



for (jj in 1:length(runs)){



  #readlines of file

  all_content = readLines(runs[jj])

  #skip the first 4 lines. This could vary depending on the csv file

  skip = all_content[-c(1:8)]

  #read the csv file

  df = read.csv(textConnection(skip), header = F, stringsAsFactors = FALSE)

  #Very important line

  df <- df[!duplicated(df),]

  #assign the column names

  colnames(df) <- c('host','score','rank','word')

  #order by rank

  df <- df[with(df, order(rank)),]

  #everything is bad here

  df$label <- 'bad'

  #Value just to plot

  df$y = 1

  #dataframe to be plotted

  toplot <- df[,c('rank','y')]

  #add a little bit of horizontal jitter to plot

  toplot$rank <- toplot$rank + as.integer((rnorm(nrow(toplot), sd = 10)))

  #adding a little bit of vertical jitter

  toplot$y <-  rnorm(nrow(toplot))*.05

  #Create true positive and false positive columns

  df$TP <- -99

  df$FP <- -99

  #insert row

  df <- insertRow(df,c(1,1,0,0,'null',1,-99,-99),1)

  #convert to numeric

  df$rank <- as.numeric(df$rank)

  #order by rank, although I thought it was already order, anyways...

  df <- df[order(df$rank),]

  #row bind to df

  df <- rbind(df,c(1,1,0,0,'null',1,1,1))

  #Convert to numeric, or make sure everything is numeric

  df$rank <- as.numeric(df$rank)

  df$TP <- as.numeric(df$TP)

  df$FP <- as.numeric(df$FP)

  df$TP[1] <- 0

  df$FP[1] <- 0

  #next lines compute the TP and FP rate

  for (i in 2:(nrow(df)-1)){

    df$TP[i] <- (i-1)/(nrow(df)-2)

    #Formula corrected

    df$FP[i] <- (df$rank[i]-i+1)/ (totalRows - nrow(df)-2)

  }

  #Next line computes the ACU

  AUC <- 0

  for (i in 1:(nrow(df)-1)){

    AUC <- AUC + (df$TP[i+1]) *(df$FP[i+1]-df$FP[i])

  }

  #Tracks the AUC for each of the replicates

  df$y <- c(0:(nrow(df)-1))

  #Write results of the df for a single replicate



  #Plot and save ranks visualization

  ggplot(toplot,aes(x=rank,y=y)) + geom_point(pch = 21,position = position_jitter(width = 1)) +

    ylim(-0.5,.5) + xlim(-round(totalRows*.1,0), totalRows) +

    ggtitle("Rank for anomalies")

  #ggsave('rank.png',width = 15, height =10, units = 'cm')

  #Plot and save the last replicate ACU

  ggplot(df, aes(FP,TP)) + geom_line() +

    xlab("FP (1-specificity)" ) + ylab("TP (sensititiy)") +

    ggtitle('ROC-AUC Proxy') +

    geom_abline(slope=1, intercept=0)



  print(runs[jj])

  print(AUC)

}






From: Christos Mathas <ma...@gmail.com>
Reply-To: "user@spot.incubator.apache.org"<ma...@spot.incubator.apache.org> <us...@spot.incubator.apache.org>
Date: Wednesday, January 31, 2018 at 3:33 AM
To: "user@spot.incubator.apache.org"<ma...@spot.incubator.apache.org> <us...@spot.incubator.apache.org>
Subject: Re: Spot-ml parameters configuration


Hi,

I tried increasing the number of max iterations in steps. I did it with 50, 70, 100, 150 and 200. It worked for all numbers except for 200, where I got this:

[Stage 37393:==================================================>(198 + 2) / 200]Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError

and the whole output of ml_ops.sh:

https://www.dropbox.com/s/maq0efs8c0xb8ga/ml.out?dl=0
I should also note that the last time it failed with the iterations set at 200, the error was about a task failing because Yarn lost an executor, but again due to java.lang.StackOverflowError. The only thing I have changed between the two executions is this: export JAVA_TOOL_OPTIONS="-Xmx16G", which I set in /etc/profile.d/ for everything in the system, to check if it was the default java memory value that was causing the problem.

So still stuck at fixing whatever is causing this. I don't have any more leads as to which configuration I should examine. When I do, I will also check the topics parameter as you suggested.

On your other questions:
"How large is your dataset?"
I will give as an example the dataset I used for the executions of ml_ops with the different iteration values as described above, which is ~1Mb. Here is a screenshot from hdfs:

https://www.dropbox.com/s/c6hp937zvl77s3p/hdfs_hive.png?dl=0

"How many attacks.." "How are you generating.."
I have a Kali VM inside the network from which I deploy the attacks in real-time while the traffic is generated. I do one attack at a time. For this particular dataset I used Armitage to do an nmap scan and a Hail Mary attack which is a really "noisy" and not sophisticated attack.

"How are you computing AUROC?"
I am computing AUROC with MS Excel, by taking as input the first 100 rows of the flow_results.csv. I have uploaded a file in dropbox so it is more clear as to how exactly I'm doing it.

https://www.dropbox.com/s/7k6erupp5jbpnpm/Hail%20Mary_9_ROC.xlsx?dl=0

Thank you
On 01/29/2018 04:49 PM, Lujan Moreno, Gustavo wrote:
Hi,

The number of iterations at 100 should be a priority. If you are not able to run them you have a technical problem with your cluster and you should fix it first. Once you are able to run 100 iterations then you should start playing around with the number of topics. Start with 5 topics, run 10 replicates, save the results (AUROC), change to 10 topics, do the same, then go to 20, 50, 100, etc. Once you have the results, plot them (x axis number of topics, y axis the AUROC) and you will be able to see the pattern or trend as well as the variation and central tendency for each of the replicates. Visually, it should be clear which number of topics works best. Finally, just to make this statistically sound run a pair-wise comparison (Tukey’s test for example) where the number of topics is your main factor. This statistical analysis is just to prove significance of results. For example, if you visually see that 10 topics is better than 5 but the test says there are no statistical differences then there is no point in running at 10 topics because is more computational expensive, you might as well just run it at numTopic = 5.

Other questions to consider:

How large is your dataset?
How many attacks are you injecting?
How are you generating the attacks?
How are you computing the AUROC?

An AUROC of 0.52 basically tells you that you are finding nothing but randomness.

Best,

Gustavo


From: Christos Minas Mathas <ma...@gmail.com>
Reply-To: "user@spot.incubator.apache.org"<ma...@spot.incubator.apache.org> <us...@spot.incubator.apache.org>
Date: Tuesday, January 23, 2018 at 10:36 AM
To: "user@spot.incubator.apache.org"<ma...@spot.incubator.apache.org> <us...@spot.incubator.apache.org>
Subject: Spot-ml parameters configuration


Hi,

I've been evaluating the Netflow component of Spot for quite some time now by using different kinds of attacks and collect the results. I'm using the default configuration, I haven't changed any of the parameters and the results I'm getting are not good. I was reading in the users mailing list some responses from Gustavo Lujan Moreno back in June 2017 in which he said about the results they're getting:

"On proxy we are getting > 0.90 on AUC and on net flow >0.99."

http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E

My results in terms of AUROC are more like ~0.52 or worse.

He also gave some tips about configuring the parameters of spot-ml. So I thought I'd try them.

". . ."--ldamaxiterations 20” is the iteration parameter. You should change that 20 for something higher, at least 100, ideally +200.
. . .
If you are not getting good results the number of iterations and topics should be your priority."

http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E

1. I changed ldamaxiterations to 200 but after running for ~60000 stages and 2 and a half hours there wasn't enough RAM in one of the associated VMs and ml_ops exited with a StackOverflowException. So I assigned 32GB of RAM to each one of the three VMs associated and this time it stopped at ~20000 stages again with a StackOverflow from another one of the associated VMs. How much RAM would I need for 200 iterations and for which services?

2. Can someone explain how can I properly configure the parameters of spot-ml? Like for the topic count for example, how can I calculate an approximate value of topics based on the traffic and the network setup?

If you need further information on my setup or the results I'm getting just let me know.

Thanks in advance

Re: Spot-ml parameters configuration

Posted by Christos Mathas <ma...@gmail.com>.

Hi Gustavo,

unfortunately I have zero knowledge of R programming, but I'll give it a 
try. So to use this script, I have to create an input file which 
contains only the attacks that were in the flow_results.csv and each 
entry should contain the columns host, score, rank, word? Also, what 
does "word" stand for?

Thank you


On 01/31/2018 05:12 PM, Lujan Moreno, Gustavo wrote:
>
> Hi Christos,
>
> If you are able to run 100 iterations it is fine, you can start 
> analyzing your results. Try to reach 200 iterations as a second 
> priority now.
>
> I have doubts about the way you are computing the AUROC. You said that 
> you are only using the first 100 values of the results. I assume they 
> are ordered and ranked. It is not totally incorrect to use the first n 
> values but that is not the usual AUROC. You need to compute it over 
> all the dataset. In order to do this you don’t have to know the rank 
> of every single row, just the rank of the attacks which we assume are 
> way fewer than normal data. I’m attaching a R script which computes 
> the AUROC given the TP and FP which is also computed in the script. 
> You need to specify the size of your normal dataset + attacks. This is 
> a very efficient way of computing the AUROC because you are only using 
> the ranks of the attacks. The script is designed to read several 
> results with a given file name pattern. You only requirement is to 
> read an input file with columns: host, score, rank, word, although you 
> can modify this. It also plots the ROC.
>
> Give it a try. You may need to modify a couple of lines.
>
> Best,
>
> Gustavo
>
> setwd("/Users/galujanm/Documents/R/Spark21_tuning/proxy2")
>
> #Call needed libraries
>
> library(pROC)
>
> library(ggplot2)
>
> #This is a function to insert row given a location
>
> insertRow<- function(existingDF, newrow, r) {
>
> existingDF[seq(r+1,nrow(existingDF)+1),] <- 
> existingDF[seq(r,nrow(existingDF)),]
>
> existingDF[r,] <- newrow
>
> existingDF
>
> }
>
> totalRows <- 722992037 #Only normal data
>
> totalRows<- totalRows+ 200. #I’m injecting 200 attacks in this case
>
> runs<-list.files(pattern= "spark21_proxy2_4_em*")
>
> masterDF <- data.frame()
>
> for (jj in 1:length(runs)){
>
> #readlines of file
>
> all_content = readLines(runs[jj])
>
> #skip the first 4 lines. This could vary depending on the csv file
>
> skip= all_content[-c(1:8)]
>
> #read the csv file
>
> df = read.csv(textConnection(skip), header = F, stringsAsFactors = FALSE)
>
> #Very important line
>
> df <- df[!duplicated(df),]
>
> #assign the column names
>
> colnames(df) <- c('host','score','rank','word')
>
> #order by rank
>
> df <- df[with(df, order(rank)),]
>
> #everything is bad here
>
> df$label <- 'bad'
>
> #Value just to plot
>
> df$y = 1
>
> #dataframe to be plotted
>
> toplot <- df[,c('rank','y')]
>
> #add a little bit of horizontal jitter to plot
>
> toplot$rank <- toplot$rank + as.integer((rnorm(nrow(toplot), sd = 10)))
>
> #adding a little bit of vertical jitter
>
> toplot$y <-rnorm(nrow(toplot))*.05
>
> #Create true positive and false positive columns
>
> df$TP <- -99
>
> df$FP <- -99
>
> #insert row
>
> df <- insertRow(df,c(1,1,0,0,'null',1,-99,-99),1)
>
> #convert to numeric
>
> df$rank <- as.numeric(df$rank)
>
> #order by rank, although I thought it was already order, anyways...
>
> df <- df[order(df$rank),]
>
> #row bind to df
>
> df <- rbind(df,c(1,1,0,0,'null',1,1,1))
>
> #Convert to numeric, or make sure everything is numeric
>
> df$rank <- as.numeric(df$rank)
>
> df$TP <- as.numeric(df$TP)
>
> df$FP <- as.numeric(df$FP)
>
> df$TP[1] <- 0
>
> df$FP[1] <- 0
>
> #next lines compute the TP and FP rate
>
> for (i in 2:(nrow(df)-1)){
>
> df$TP[i] <- (i-1)/(nrow(df)-2)
>
> #Formula corrected
>
> df$FP[i] <- (df$rank[i]-i+1)/ (totalRows - nrow(df)-2)
>
> }
>
> #Next line computes the ACU
>
> AUC <- 0
>
> for (i in 1:(nrow(df)-1)){
>
> AUC <- AUC + (df$TP[i+1]) *(df$FP[i+1]-df$FP[i])
>
> }
>
> #Tracks the AUC for each of the replicates
>
> df$y <- c(0:(nrow(df)-1))
>
> #Write results of the df for a single replicate
>
> #Plot and save ranks visualization
>
> ggplot(toplot,aes(x=rank,y=y)) + geom_point(pch = 21,position = 
> position_jitter(width = 1)) +
>
> ylim(-0.5,.5) + xlim(-round(totalRows*.1,0), totalRows) +
>
> ggtitle("Rank for anomalies")
>
> #ggsave('rank.png',width = 15, height =10, units = 'cm')
>
> #Plot and save the last replicate ACU
>
> ggplot(df, aes(FP,TP)) + geom_line() +
>
> xlab("FP (1-specificity)") + ylab("TP (sensititiy)") +
>
> ggtitle('ROC-AUC Proxy') +
>
> geom_abline(slope=1, intercept=0)
>
> print(runs[jj])
>
> print(AUC)
>
> }
>
> *From: *Christos Mathas <ma...@gmail.com>
> *Reply-To: *"user@spot.incubator.apache.org" 
> <us...@spot.incubator.apache.org>
> *Date: *Wednesday, January 31, 2018 at 3:33 AM
> *To: *"user@spot.incubator.apache.org" <us...@spot.incubator.apache.org>
> *Subject: *Re: Spot-ml parameters configuration
>
> Hi,
>
> I tried increasing the number of max iterations in steps. I did it 
> with 50, 70, 100, 150 and 200. It worked for all numbers except for 
> 200, where I got this:
>
> [Stage 37393:==================================================>(198 + 
> 2) / 200]Exception in thread "main" org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task serialization failed: 
> java.lang.StackOverflowError
>
> and the whole output of ml_ops.sh:
>
> https://www.dropbox.com/s/maq0efs8c0xb8ga/ml.out?dl=0
>
> I should also note that the last time it failed with the iterations 
> set at 200, the error was about a task failing because Yarn lost an 
> executor, but again due to java.lang.StackOverflowError. The only 
> thing I have changed between the two executions is this: export 
> JAVA_TOOL_OPTIONS="-Xmx16G", which I set in /etc/profile.d/ for 
> everything in the system, to check if it was the default java memory 
> value that was causing the problem.
>
> So still stuck at fixing whatever is causing this. I don't have any 
> more leads as to which configuration I should examine. When I do, I 
> will also check the topics parameter as you suggested.
>
> On your other questions:
> "How large is your dataset?"
> I will give as an example the dataset I used for the executions of 
> ml_ops with the different iteration values as described above, which 
> is ~1Mb. Here is a screenshot from hdfs:
>
> https://www.dropbox.com/s/c6hp937zvl77s3p/hdfs_hive.png?dl=0
>
> "How many attacks.." "How are you generating.."
> I have a Kali VM inside the network from which I deploy the attacks in 
> real-time while the traffic is generated. I do one attack at a time. 
> For this particular dataset I used Armitage to do an nmap scan and a 
> Hail Mary attack which is a really "noisy" and not sophisticated attack.
>
> "How are you computing AUROC?"
> I am computing AUROC with MS Excel, by taking as input the first 100 
> rows of the flow_results.csv. I have uploaded a file in dropbox so it 
> is more clear as to how exactly I'm doing it.
>
> https://www.dropbox.com/s/7k6erupp5jbpnpm/Hail%20Mary_9_ROC.xlsx?dl=0
>
> Thank you
>
> On 01/29/2018 04:49 PM, Lujan Moreno, Gustavo wrote:
>
>     Hi,
>
>     The number of iterations at 100 should be a priority. If you are
>     not able to run them you have a technical problem with your
>     cluster and you should fix it first. Once you are able to run 100
>     iterations then you should start playing around with the number of
>     topics. Start with 5 topics, run 10 replicates, save the results
>     (AUROC), change to 10 topics, do the same, then go to 20, 50, 100,
>     etc. Once you have the results, plot them (x axis number of
>     topics, y axis the AUROC) and you will be able to see the pattern
>     or trend as well as the variation and central tendency for each of
>     the replicates. Visually, it should be clear which number of
>     topics works best. Finally, just to make this statistically sound
>     run a pair-wise comparison (Tukey’s test for example) where the
>     number of topics is your main factor. This statistical analysis is
>     just to prove significance of results. For example, if you
>     visually see that 10 topics is better than 5 but the test says
>     there are no statistical differences then there is no point in
>     running at 10 topics because is more computational expensive, you
>     might as well just run it at numTopic = 5.
>
>     Other questions to consider:
>
>     How large is your dataset?
>
>     How many attacks are you injecting?
>
>     How are you generating the attacks?
>
>     How are you computing the AUROC?
>
>     An AUROC of 0.52 basically tells you that you are finding nothing
>     but randomness.
>
>     Best,
>
>     Gustavo
>
>     *From: *Christos Minas Mathas <ma...@gmail.com>
>     <ma...@gmail.com>
>     *Reply-To: *"user@spot.incubator.apache.org"
>     <ma...@spot.incubator.apache.org>
>     <us...@spot.incubator.apache.org>
>     <ma...@spot.incubator.apache.org>
>     *Date: *Tuesday, January 23, 2018 at 10:36 AM
>     *To: *"user@spot.incubator.apache.org"
>     <ma...@spot.incubator.apache.org>
>     <us...@spot.incubator.apache.org>
>     <ma...@spot.incubator.apache.org>
>     *Subject: *Spot-ml parameters configuration
>
>     Hi,
>
>     I've been evaluating the Netflow component of Spot for quite some
>     time now by using different kinds of attacks and collect the
>     results. I'm using the default configuration, I haven't changed
>     any of the parameters and the results I'm getting are not good. I
>     was reading in the users mailing list some responses from Gustavo
>     Lujan Moreno back in June 2017 in which he said about the results
>     they're getting:
>     /
>     /"On proxy we are getting > 0.90 on AUC and on net flow >0.99."//
>
>     http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E
>
>     My results in terms of AUROC are more like ~0.52 or worse.
>
>     He also gave some tips about configuring the parameters of
>     spot-ml. So I thought I'd try them.
>
>     "/. . ."--ldamaxiterations 20” is the iteration parameter. You
>     should change that 20 for something higher, at least 100, ideally
>     +200.//
>     /. . ./
>     /If you are not getting good results the number of iterations and
>     topics should be your priority.//"
>
>     http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E
>
>     1. I changed ldamaxiterations to 200 but after running for ~60000
>     stages and 2 and a half hours there wasn't enough RAM in one of
>     the associated VMs and ml_ops exited with a
>     StackOverflowException. So I assigned 32GB of RAM to each one of
>     the three VMs associated and this time it stopped at ~20000 stages
>     again with a StackOverflow from another one of the associated VMs.
>     How much RAM would I need for 200 iterations and for which services?
>
>     2. Can someone explain how can I properly configure the parameters
>     of spot-ml? Like for the topic count for example, how can I
>     calculate an approximate value of topics based on the traffic and
>     the network setup?
>
>     If you need further information on my setup or the results I'm
>     getting just let me know.
>
>     Thanks in advance
>
>
>

Re: Spot-ml parameters configuration

Posted by "Lujan Moreno, Gustavo" <gu...@intel.com>.

Hi Christos,

If you are able to run 100 iterations it is fine, you can start analyzing your results. Try to reach 200 iterations as a second priority now.

I have doubts about the way you are computing the AUROC. You said that you are only using the first 100 values of the results. I assume they are ordered and ranked. It is not totally incorrect to use the first n values but that is not the usual AUROC. You need to compute it over all the dataset. In order to do this you don’t have to know the rank of every single row, just the rank of the attacks which we assume are way fewer than normal data. I’m attaching a R script which computes the AUROC given the TP and FP which is also computed in the script. You need to specify the size of your normal dataset + attacks. This is a very efficient way of computing the AUROC because you are only using the ranks of the attacks. The script is designed to read several results with a given file name pattern. You only requirement is to read an input file with columns: host, score, rank, word, although you can modify this. It also plots the ROC.

Give it a try. You may need to modify a couple of lines.

Best,

Gustavo



setwd("/Users/galujanm/Documents/R/Spark21_tuning/proxy2")

#Call needed libraries

library(pROC)

library(ggplot2)

#This is a function to insert row given a location

insertRow <- function(existingDF, newrow, r) {

  existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]

  existingDF[r,] <- newrow

  existingDF

}



totalRows <- 722992037 #Only normal data

totalRows <- totalRows + 200. #I’m injecting 200 attacks in this case



runs <-  list.files(pattern = "spark21_proxy2_4_em*")



masterDF <- data.frame()



for (jj in 1:length(runs)){



  #readlines of file

  all_content = readLines(runs[jj])

  #skip the first 4 lines. This could vary depending on the csv file

  skip = all_content[-c(1:8)]

  #read the csv file

  df = read.csv(textConnection(skip), header = F, stringsAsFactors = FALSE)

  #Very important line

  df <- df[!duplicated(df),]

  #assign the column names

  colnames(df) <- c('host','score','rank','word')

  #order by rank

  df <- df[with(df, order(rank)),]

  #everything is bad here

  df$label <- 'bad'

  #Value just to plot

  df$y = 1

  #dataframe to be plotted

  toplot <- df[,c('rank','y')]

  #add a little bit of horizontal jitter to plot

  toplot$rank <- toplot$rank + as.integer((rnorm(nrow(toplot), sd = 10)))

  #adding a little bit of vertical jitter

  toplot$y <-  rnorm(nrow(toplot))*.05

  #Create true positive and false positive columns

  df$TP <- -99

  df$FP <- -99

  #insert row

  df <- insertRow(df,c(1,1,0,0,'null',1,-99,-99),1)

  #convert to numeric

  df$rank <- as.numeric(df$rank)

  #order by rank, although I thought it was already order, anyways...

  df <- df[order(df$rank),]

  #row bind to df

  df <- rbind(df,c(1,1,0,0,'null',1,1,1))

  #Convert to numeric, or make sure everything is numeric

  df$rank <- as.numeric(df$rank)

  df$TP <- as.numeric(df$TP)

  df$FP <- as.numeric(df$FP)

  df$TP[1] <- 0

  df$FP[1] <- 0

  #next lines compute the TP and FP rate

  for (i in 2:(nrow(df)-1)){

    df$TP[i] <- (i-1)/(nrow(df)-2)

    #Formula corrected

    df$FP[i] <- (df$rank[i]-i+1)/ (totalRows - nrow(df)-2)

  }

  #Next line computes the ACU

  AUC <- 0

  for (i in 1:(nrow(df)-1)){

    AUC <- AUC + (df$TP[i+1]) *(df$FP[i+1]-df$FP[i])

  }

  #Tracks the AUC for each of the replicates

  df$y <- c(0:(nrow(df)-1))

  #Write results of the df for a single replicate



  #Plot and save ranks visualization

  ggplot(toplot,aes(x=rank,y=y)) + geom_point(pch = 21,position = position_jitter(width = 1)) +

    ylim(-0.5,.5) + xlim(-round(totalRows*.1,0), totalRows) +

    ggtitle("Rank for anomalies")

  #ggsave('rank.png',width = 15, height =10, units = 'cm')

  #Plot and save the last replicate ACU

  ggplot(df, aes(FP,TP)) + geom_line() +

    xlab("FP (1-specificity)" ) + ylab("TP (sensititiy)") +

    ggtitle('ROC-AUC Proxy') +

    geom_abline(slope=1, intercept=0)



  print(runs[jj])

  print(AUC)

}






From: Christos Mathas <ma...@gmail.com>
Reply-To: "user@spot.incubator.apache.org" <us...@spot.incubator.apache.org>
Date: Wednesday, January 31, 2018 at 3:33 AM
To: "user@spot.incubator.apache.org" <us...@spot.incubator.apache.org>
Subject: Re: Spot-ml parameters configuration


Hi,

I tried increasing the number of max iterations in steps. I did it with 50, 70, 100, 150 and 200. It worked for all numbers except for 200, where I got this:

[Stage 37393:==================================================>(198 + 2) / 200]Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError

and the whole output of ml_ops.sh:

https://www.dropbox.com/s/maq0efs8c0xb8ga/ml.out?dl=0
I should also note that the last time it failed with the iterations set at 200, the error was about a task failing because Yarn lost an executor, but again due to java.lang.StackOverflowError. The only thing I have changed between the two executions is this: export JAVA_TOOL_OPTIONS="-Xmx16G", which I set in /etc/profile.d/ for everything in the system, to check if it was the default java memory value that was causing the problem.

So still stuck at fixing whatever is causing this. I don't have any more leads as to which configuration I should examine. When I do, I will also check the topics parameter as you suggested.

On your other questions:
"How large is your dataset?"
I will give as an example the dataset I used for the executions of ml_ops with the different iteration values as described above, which is ~1Mb. Here is a screenshot from hdfs:

https://www.dropbox.com/s/c6hp937zvl77s3p/hdfs_hive.png?dl=0

"How many attacks.." "How are you generating.."
I have a Kali VM inside the network from which I deploy the attacks in real-time while the traffic is generated. I do one attack at a time. For this particular dataset I used Armitage to do an nmap scan and a Hail Mary attack which is a really "noisy" and not sophisticated attack.

"How are you computing AUROC?"
I am computing AUROC with MS Excel, by taking as input the first 100 rows of the flow_results.csv. I have uploaded a file in dropbox so it is more clear as to how exactly I'm doing it.

https://www.dropbox.com/s/7k6erupp5jbpnpm/Hail%20Mary_9_ROC.xlsx?dl=0

Thank you
On 01/29/2018 04:49 PM, Lujan Moreno, Gustavo wrote:
Hi,

The number of iterations at 100 should be a priority. If you are not able to run them you have a technical problem with your cluster and you should fix it first. Once you are able to run 100 iterations then you should start playing around with the number of topics. Start with 5 topics, run 10 replicates, save the results (AUROC), change to 10 topics, do the same, then go to 20, 50, 100, etc. Once you have the results, plot them (x axis number of topics, y axis the AUROC) and you will be able to see the pattern or trend as well as the variation and central tendency for each of the replicates. Visually, it should be clear which number of topics works best. Finally, just to make this statistically sound run a pair-wise comparison (Tukey’s test for example) where the number of topics is your main factor. This statistical analysis is just to prove significance of results. For example, if you visually see that 10 topics is better than 5 but the test says there are no statistical differences then there is no point in running at 10 topics because is more computational expensive, you might as well just run it at numTopic = 5.

Other questions to consider:

How large is your dataset?
How many attacks are you injecting?
How are you generating the attacks?
How are you computing the AUROC?

An AUROC of 0.52 basically tells you that you are finding nothing but randomness.

Best,

Gustavo


From: Christos Minas Mathas <ma...@gmail.com>
Reply-To: "user@spot.incubator.apache.org"<ma...@spot.incubator.apache.org> <us...@spot.incubator.apache.org>
Date: Tuesday, January 23, 2018 at 10:36 AM
To: "user@spot.incubator.apache.org"<ma...@spot.incubator.apache.org> <us...@spot.incubator.apache.org>
Subject: Spot-ml parameters configuration


Hi,

I've been evaluating the Netflow component of Spot for quite some time now by using different kinds of attacks and collect the results. I'm using the default configuration, I haven't changed any of the parameters and the results I'm getting are not good. I was reading in the users mailing list some responses from Gustavo Lujan Moreno back in June 2017 in which he said about the results they're getting:

"On proxy we are getting > 0.90 on AUC and on net flow >0.99."

http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E

My results in terms of AUROC are more like ~0.52 or worse.

He also gave some tips about configuring the parameters of spot-ml. So I thought I'd try them.

". . ."--ldamaxiterations 20” is the iteration parameter. You should change that 20 for something higher, at least 100, ideally +200.
. . .
If you are not getting good results the number of iterations and topics should be your priority."

http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E

1. I changed ldamaxiterations to 200 but after running for ~60000 stages and 2 and a half hours there wasn't enough RAM in one of the associated VMs and ml_ops exited with a StackOverflowException. So I assigned 32GB of RAM to each one of the three VMs associated and this time it stopped at ~20000 stages again with a StackOverflow from another one of the associated VMs. How much RAM would I need for 200 iterations and for which services?

2. Can someone explain how can I properly configure the parameters of spot-ml? Like for the topic count for example, how can I calculate an approximate value of topics based on the traffic and the network setup?

If you need further information on my setup or the results I'm getting just let me know.

Thanks in advance

Re: Spot-ml parameters configuration

Posted by Christos Mathas <ma...@gmail.com>.

Hi,

I tried increasing the number of max iterations in steps. I did it with 
50, 70, 100, 150 and 200. It worked for all numbers except for 200, 
where I got this:

[Stage 37393:==================================================>(198 + 
2) / 200]Exception in thread "main" org.apache.spark.SparkException: Job 
aborted due to stage failure: Task serialization failed: 
java.lang.StackOverflowError

and the whole output of ml_ops.sh:

https://www.dropbox.com/s/maq0efs8c0xb8ga/ml.out?dl=0

I should also note that the last time it failed with the iterations set 
at 200, the error was about a task failing because Yarn lost an 
executor, but again due to java.lang.StackOverflowError. The only thing 
I have changed between the two executions is this: export 
JAVA_TOOL_OPTIONS="-Xmx16G", which I set in /etc/profile.d/ for 
everything in the system, to check if it was the default java memory 
value that was causing the problem.

So still stuck at fixing whatever is causing this. I don't have any more 
leads as to which configuration I should examine. When I do, I will also 
check the topics parameter as you suggested.

On your other questions:
"How large is your dataset?"
I will give as an example the dataset I used for the executions of 
ml_ops with the different iteration values as described above, which is 
~1Mb. Here is a screenshot from hdfs:

https://www.dropbox.com/s/c6hp937zvl77s3p/hdfs_hive.png?dl=0

"How many attacks.." "How are you generating.."
I have a Kali VM inside the network from which I deploy the attacks in 
real-time while the traffic is generated. I do one attack at a time. For 
this particular dataset I used Armitage to do an nmap scan and a Hail 
Mary attack which is a really "noisy" and not sophisticated attack.

"How are you computing AUROC?"
I am computing AUROC with MS Excel, by taking as input the first 100 
rows of the flow_results.csv. I have uploaded a file in dropbox so it is 
more clear as to how exactly I'm doing it.

https://www.dropbox.com/s/7k6erupp5jbpnpm/Hail%20Mary_9_ROC.xlsx?dl=0

Thank you

On 01/29/2018 04:49 PM, Lujan Moreno, Gustavo wrote:
>
> Hi,
>
> The number of iterations at 100 should be a priority. If you are not 
> able to run them you have a technical problem with your cluster and 
> you should fix it first. Once you are able to run 100 iterations then 
> you should start playing around with the number of topics. Start with 
> 5 topics, run 10 replicates, save the results (AUROC), change to 10 
> topics, do the same, then go to 20, 50, 100, etc. Once you have the 
> results, plot them (x axis number of topics, y axis the AUROC) and you 
> will be able to see the pattern or trend as well as the variation and 
> central tendency for each of the replicates. Visually, it should be 
> clear which number of topics works best. Finally, just to make this 
> statistically sound run a pair-wise comparison (Tukey’s test for 
> example) where the number of topics is your main factor. This 
> statistical analysis is just to prove significance of results. For 
> example, if you visually see that 10 topics is better than 5 but the 
> test says there are no statistical differences then there is no point 
> in running at 10 topics because is more computational expensive, you 
> might as well just run it at numTopic = 5.
>
> Other questions to consider:
>
> How large is your dataset?
>
> How many attacks are you injecting?
>
> How are you generating the attacks?
>
> How are you computing the AUROC?
>
> An AUROC of 0.52 basically tells you that you are finding nothing but 
> randomness.
>
> Best,
>
> Gustavo
>
> *From: *Christos Minas Mathas <ma...@gmail.com>
> *Reply-To: *"user@spot.incubator.apache.org" 
> <us...@spot.incubator.apache.org>
> *Date: *Tuesday, January 23, 2018 at 10:36 AM
> *To: *"user@spot.incubator.apache.org" <us...@spot.incubator.apache.org>
> *Subject: *Spot-ml parameters configuration
>
> Hi,
>
> I've been evaluating the Netflow component of Spot for quite some time 
> now by using different kinds of attacks and collect the results. I'm 
> using the default configuration, I haven't changed any of the 
> parameters and the results I'm getting are not good. I was reading in 
> the users mailing list some responses from Gustavo Lujan Moreno back 
> in June 2017 in which he said about the results they're getting:
> /
> /"On proxy we are getting > 0.90 on AUC and on net flow >0.99."//
>
> http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E
>
> My results in terms of AUROC are more like ~0.52 or worse.
>
> He also gave some tips about configuring the parameters of spot-ml. So 
> I thought I'd try them.
>
> "/. . ."--ldamaxiterations 20” is the iteration parameter. You should 
> change that 20 for something higher, at least 100, ideally +200.//
> /. . ./
> /If you are not getting good results the number of iterations and 
> topics should be your priority.//"
>
> http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E
>
> 1. I changed ldamaxiterations to 200 but after running for ~60000 
> stages and 2 and a half hours there wasn't enough RAM in one of the 
> associated VMs and ml_ops exited with a StackOverflowException. So I 
> assigned 32GB of RAM to each one of the three VMs associated and this 
> time it stopped at ~20000 stages again with a StackOverflow from 
> another one of the associated VMs. How much RAM would I need for 200 
> iterations and for which services?
>
> 2. Can someone explain how can I properly configure the parameters of 
> spot-ml? Like for the topic count for example, how can I calculate an 
> approximate value of topics based on the traffic and the network setup?
>
> If you need further information on my setup or the results I'm getting 
> just let me know.
>
> Thanks in advance
>

Re: Spot-ml parameters configuration

Posted by "Lujan Moreno, Gustavo" <gu...@intel.com>.

Hi,

The number of iterations at 100 should be a priority. If you are not able to run them you have a technical problem with your cluster and you should fix it first. Once you are able to run 100 iterations then you should start playing around with the number of topics. Start with 5 topics, run 10 replicates, save the results (AUROC), change to 10 topics, do the same, then go to 20, 50, 100, etc. Once you have the results, plot them (x axis number of topics, y axis the AUROC) and you will be able to see the pattern or trend as well as the variation and central tendency for each of the replicates. Visually, it should be clear which number of topics works best. Finally, just to make this statistically sound run a pair-wise comparison (Tukey’s test for example) where the number of topics is your main factor. This statistical analysis is just to prove significance of results. For example, if you visually see that 10 topics is better than 5 but the test says there are no statistical differences then there is no point in running at 10 topics because is more computational expensive, you might as well just run it at numTopic = 5.

Other questions to consider:

How large is your dataset?
How many attacks are you injecting?
How are you generating the attacks?
How are you computing the AUROC?

An AUROC of 0.52 basically tells you that you are finding nothing but randomness.

Best,

Gustavo


From: Christos Minas Mathas <ma...@gmail.com>
Reply-To: "user@spot.incubator.apache.org" <us...@spot.incubator.apache.org>
Date: Tuesday, January 23, 2018 at 10:36 AM
To: "user@spot.incubator.apache.org" <us...@spot.incubator.apache.org>
Subject: Spot-ml parameters configuration


Hi,

I've been evaluating the Netflow component of Spot for quite some time now by using different kinds of attacks and collect the results. I'm using the default configuration, I haven't changed any of the parameters and the results I'm getting are not good. I was reading in the users mailing list some responses from Gustavo Lujan Moreno back in June 2017 in which he said about the results they're getting:

"On proxy we are getting > 0.90 on AUC and on net flow >0.99."

http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E

My results in terms of AUROC are more like ~0.52 or worse.

He also gave some tips about configuring the parameters of spot-ml. So I thought I'd try them.

". . ."--ldamaxiterations 20” is the iteration parameter. You should change that 20 for something higher, at least 100, ideally +200.
. . .
If you are not getting good results the number of iterations and topics should be your priority."

http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E

1. I changed ldamaxiterations to 200 but after running for ~60000 stages and 2 and a half hours there wasn't enough RAM in one of the associated VMs and ml_ops exited with a StackOverflowException. So I assigned 32GB of RAM to each one of the three VMs associated and this time it stopped at ~20000 stages again with a StackOverflow from another one of the associated VMs. How much RAM would I need for 200 iterations and for which services?

2. Can someone explain how can I properly configure the parameters of spot-ml? Like for the topic count for example, how can I calculate an approximate value of topics based on the traffic and the network setup?

If you need further information on my setup or the results I'm getting just let me know.

Thanks in advance