You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spot.apache.org by Christos Mathas <ma...@gmail.com> on 2018/01/19 09:30:33 UTC

ml_ops.sh fails with NumberFormatException when reading flow_scores.csv

Hi,

I'm running ml_ops.sh and I have scored previous results so ml tries to 
read the data from flow_scores.csv . It fails in stage 2 and the output 
is this:


[Stage 2:>                                                          (0 + 
2) / 4]18/01/19 11:13:57 WARN scheduler.TaskSetManager: Lost task 2.0 in 
stage 2.0 (TID 5, cloudera-host-2.shield.com, executor 1): 
java.lang.NumberFormatException: For input string: "0,2018-01-18 
09:35:42,193.93.167.241,10.101.30.60,123,123,UDP,2,152,0,0,3.0071374283430035E-5,56,,,,,,,,"
     at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
     at java.lang.Integer.parseInt(Integer.java:492)
     at java.lang.Integer.parseInt(Integer.java:527)
     at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
     at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
     at 
org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
     at 
org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
     at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
     at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
     at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
     at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
     at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
     at org.apache.spark.scheduler.Task.run(Task.scala:89)
     at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
     at java.lang.Thread.run(Thread.java:745)

.

.

.

As you can see the problem is that it attempts to read the whole line, 
it hasn't split it. My understanding is that the file responsible for 
parsing the csv is FlowFeedback.scala 
(https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowFeedback.scala). 
I saw in the code that it splits the data by "\t", so I checked the 
flow_scores.csv and found out that it is comma(",") seperated and not 
"\t". I tried replacing "\t" with "," but I got the exact same error. I 
don't know scala programming so I'm asking for your help as to how I 
could fix this.

Thank you in advance

Re: ml_ops.sh fails with NumberFormatException when reading flow_scores.csv

Posted by Christos Mathas <ma...@gmail.com>.

I accidentally replied only to Curtis and not to the list, so I'm 
replying to the list because the problem has been resolved:

Actually I have an older version of Apache Spot, so the code I'm running 
has a lot of differences from the one in github. I have made changes at 
the FlowFeedback.scala and was able to parse the file correctly. Thank 
you for your time


On 01/23/2018 06:34 AM, Ricardo Barona wrote:
> Hi Christos,
>
> Curtis is absolutely right, what you need to pass is feedback. This is 
> the only part of the processes closely titgh to spot-oa. After scoring 
> with spot ml, spot OA will show the top N connections less probable to 
> occur, then security experts should determine if it’s actually an 
> attack or a false positive. After that a feedback will be saved in the 
> location mentioned by Curtis.
>
> I can share the fields and format of a feedback file if you just want 
> to “recreate” the flow.
>
> Let me know.
>
> On Mon, Jan 22, 2018 at 8:32 AM Curtis Howard <curtis@cloudera.com 
> <ma...@cloudera.com>> wrote:
>
>     Hi Christos,
>
>     Your application seems to be using netflow /results/ rather than a
>     /feedback/ file.  As you mention, the feedback file uses a "\t"
>     delimiter, and the following schema:
>     https://github.com/apache/incubator-spot/blob/ab11e8c8a00b137aafff60c85cadc5edb8150020/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowFeedback.scala#L62
>
>     By default, ml_ops.sh looks for the feedback file at the following
>     HDFS path ($HPATH defined in /etc/spot.conf):
>     ${HPATH}/feedback/ml_feedback.csv
>     relevant code:
>     https://github.com/apache/incubator-spot/blob/ab11e8c8a00b137aafff60c85cadc5edb8150020/spot-ml/ml_ops.sh#L97
>
>     In addition to this user mail list, there's also a Spot channel on
>     Slack, which you can use to ask questions:
>     http://slack.apache-spot.io/
>
>     Hope this helps
>
>     Curtis
>
>     On Fri, Jan 19, 2018 at 4:30 AM, Christos Mathas
>     <mathas.ch.m@gmail.com <ma...@gmail.com>> wrote:
>
>         Hi,
>
>         I'm running ml_ops.sh and I have scored previous results so ml
>         tries to read the data from flow_scores.csv . It fails in
>         stage 2 and the output is this:
>
>
>         [Stage 2:> (0 + 2) / 4]18/01/19 11:13:57 WARN
>         scheduler.TaskSetManager: Lost task 2.0 in stage 2.0 (TID 5,
>         cloudera-host-2.shield.com
>         <http://cloudera-host-2.shield.com>, executor 1):
>         java.lang.NumberFormatException: For input string:
>         "0,2018-01-18 09:35:42,193.93.167.241
>         <tel:193.93.167.241>,10.101.30.60,123,123,UDP,2,152,0,0,3.0071374283430035E-5,56,,,,,,,,"
>             at
>         java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>             at java.lang.Integer.parseInt(Integer.java:492)
>             at java.lang.Integer.parseInt(Integer.java:527)
>             at
>         scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>             at
>         scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>             at
>         org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
>             at
>         org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
>             at
>         scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>             at
>         scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>             at
>         scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>             at
>         scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>             at
>         scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>             at
>         scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>             at
>         scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>             at
>         scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>             at
>         org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>             at
>         org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>             at
>         org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>             at
>         org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>             at org.apache.spark.scheduler.Task.run(Task.scala:89)
>             at
>         org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
>             at
>         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>             at
>         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>             at java.lang.Thread.run(Thread.java:745)
>
>         .
>
>         .
>
>         .
>
>         As you can see the problem is that it attempts to read the
>         whole line, it hasn't split it. My understanding is that the
>         file responsible for parsing the csv is FlowFeedback.scala
>         (https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowFeedback.scala).
>         I saw in the code that it splits the data by "\t", so I
>         checked the flow_scores.csv and found out that it is
>         comma(",") seperated and not "\t". I tried replacing "\t" with
>         "," but I got the exact same error. I don't know scala
>         programming so I'm asking for your help as to how I could fix
>         this.
>
>         Thank you in advance
>
>

Re: ml_ops.sh fails with NumberFormatException when reading flow_scores.csv

Posted by Ricardo Barona <ri...@gmail.com>.

Hi Christos,

Curtis is absolutely right, what you need to pass is feedback. This is the
only part of the processes closely titgh to spot-oa. After scoring with
spot ml, spot OA will show the top N connections less probable to occur,
then security experts should determine if it’s actually an attack or a
false positive. After that a feedback will be saved in the location
mentioned by Curtis.

I can share the fields and format of a feedback file if you just want to
“recreate” the flow.

Let me know.

On Mon, Jan 22, 2018 at 8:32 AM Curtis Howard <cu...@cloudera.com> wrote:

> Hi Christos,
>
> Your application seems to be using netflow *results* rather than a
> *feedback* file.  As you mention, the feedback file uses a "\t"
> delimiter, and the following schema:
>
> https://github.com/apache/incubator-spot/blob/ab11e8c8a00b137aafff60c85cadc5edb8150020/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowFeedback.scala#L62
>
> By default, ml_ops.sh looks for the feedback file at the following HDFS
> path ($HPATH defined in /etc/spot.conf):
> ${HPATH}/feedback/ml_feedback.csv
> relevant code:
> https://github.com/apache/incubator-spot/blob/ab11e8c8a00b137aafff60c85cadc5edb8150020/spot-ml/ml_ops.sh#L97
>
> In addition to this user mail list, there's also a Spot channel on Slack,
> which you can use to ask questions:  http://slack.apache-spot.io/
>
> Hope this helps
>
> Curtis
>
> On Fri, Jan 19, 2018 at 4:30 AM, Christos Mathas <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm running ml_ops.sh and I have scored previous results so ml tries to
>> read the data from flow_scores.csv . It fails in stage 2 and the output is
>> this:
>>
>>
>> [Stage 2:>                                                          (0 +
>> 2) / 4]18/01/19 11:13:57 WARN scheduler.TaskSetManager: Lost task 2.0 in
>> stage 2.0 (TID 5, cloudera-host-2.shield.com, executor 1):
>> java.lang.NumberFormatException: For input string: "0,2018-01-18 09:35:42,
>> 193.93.167.241
>> ,10.101.30.60,123,123,UDP,2,152,0,0,3.0071374283430035E-5,56,,,,,,,,"
>>     at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>     at java.lang.Integer.parseInt(Integer.java:492)
>>     at java.lang.Integer.parseInt(Integer.java:527)
>>     at
>> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>>     at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>>     at
>> org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
>>     at
>> org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
>>     at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>     at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>     at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>     at
>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>>     at
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>>     at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>     at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>     at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>     at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>     at java.lang.Thread.run(Thread.java:745)
>>
>> .
>>
>> .
>>
>> .
>>
>> As you can see the problem is that it attempts to read the whole line, it
>> hasn't split it. My understanding is that the file responsible for parsing
>> the csv is FlowFeedback.scala (
>> https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowFeedback.scala).
>> I saw in the code that it splits the data by "\t", so I checked the
>> flow_scores.csv and found out that it is comma(",") seperated and not "\t".
>> I tried replacing "\t" with "," but I got the exact same error. I don't
>> know scala programming so I'm asking for your help as to how I could fix
>> this.
>>
>> Thank you in advance
>>
>>
>

Re: ml_ops.sh fails with NumberFormatException when reading flow_scores.csv

Posted by Curtis Howard <cu...@cloudera.com>.

Hi Christos,

Your application seems to be using netflow *results* rather than a
*feedback* file.  As you mention, the feedback file uses a "\t" delimiter,
and the following schema:
https://github.com/apache/incubator-spot/blob/ab11e8c8a00b13
7aafff60c85cadc5edb8150020/spot-ml/src/main/scala/org/
apache/spot/netflow/model/FlowFeedback.scala#L62

By default, ml_ops.sh looks for the feedback file at the following HDFS
path ($HPATH defined in /etc/spot.conf):
${HPATH}/feedback/ml_feedback.csv
relevant code:  https://github.com/apache/incubator-spot/blob/ab1
1e8c8a00b137aafff60c85cadc5edb8150020/spot-ml/ml_ops.sh#L97

In addition to this user mail list, there's also a Spot channel on Slack,
which you can use to ask questions:  http://slack.apache-spot.io/

Hope this helps

Curtis

On Fri, Jan 19, 2018 at 4:30 AM, Christos Mathas <ma...@gmail.com>
wrote:

> Hi,
>
> I'm running ml_ops.sh and I have scored previous results so ml tries to
> read the data from flow_scores.csv . It fails in stage 2 and the output is
> this:
>
>
> [Stage 2:>                                                          (0 +
> 2) / 4]18/01/19 11:13:57 WARN scheduler.TaskSetManager: Lost task 2.0 in
> stage 2.0 (TID 5, cloudera-host-2.shield.com, executor 1):
> java.lang.NumberFormatException: For input string: "0,2018-01-18 09:35:42,
> 193.93.167.241,10.101.30.60,123,123,UDP,2,152,0,
> 0,3.0071374283430035E-5,56,,,,,,,,"
>     at java.lang.NumberFormatException.forInputString(NumberFormatE
> xception.java:65)
>     at java.lang.Integer.parseInt(Integer.java:492)
>     at java.lang.Integer.parseInt(Integer.java:527)
>     at scala.collection.immutable.StringLike$class.toInt(StringLike
> .scala:229)
>     at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>     at org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeed
> backDF$2.apply(FlowFeedback.scala:85)
>     at org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeed
> backDF$2.apply(FlowFeedback.scala:85)
>     at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at org.apache.spark.util.collection.ExternalSorter.insertAll(
> ExternalSorter.scala:192)
>     at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortSh
> uffleWriter.scala:64)
>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
> Task.scala:73)
>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
> Task.scala:41)
>     at org.apache.spark.scheduler.Task.run(Task.scala:89)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
> scala:242)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1145)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
> lExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:745)
>
> .
>
> .
>
> .
>
> As you can see the problem is that it attempts to read the whole line, it
> hasn't split it. My understanding is that the file responsible for parsing
> the csv is FlowFeedback.scala (https://github.com/apache/inc
> ubator-spot/blob/master/spot-ml/src/main/scala/org/apache/
> spot/netflow/model/FlowFeedback.scala). I saw in the code that it splits
> the data by "\t", so I checked the flow_scores.csv and found out that it is
> comma(",") seperated and not "\t". I tried replacing "\t" with "," but I
> got the exact same error. I don't know scala programming so I'm asking for
> your help as to how I could fix this.
>
> Thank you in advance
>
>