You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samoa.apache.org by Bertsimas Ilias <aw...@gmail.com> on 2015/05/13 15:03:35 UTC

SAMOA getting started help

Hi all!

I am in the process of running some tests for online machine learning in
data streams from social media. I came across apache-SAMOA and seemed like
a very interesting framework.
However it was not possible to figure out how to get it to test and train
using a sparse array of tf-idf feature vectors. I provide the data in the
standard WEKA arff format and although it run, the output is something
along the lines of:

2015-05-12 22:58:58,993 [main] INFO
>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
> (EvaluatorProcessor.java:189) -
> com.yahoo.labs.samoa.evaluation.EvaluatorProcessorid = 0
> evaluation instances,classified instances,classifications correct
> (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent)
> 100.0,100.0,100.0,100.0,?
> 200.0,200.0,100.0,100.0,?
> 300.0,300.0,100.0,100.0,?
> 400.0,400.0,100.0,100.0,?
> 500.0,500.0,100.0,100.0,?
> 600.0,600.0,100.0,100.0,?
> 700.0,700.0,100.0,100.0,?
> 800.0,800.0,100.0,100.0,?
> 900.0,900.0,100.0,100.0,?
> 1000.0,1000.0,100.0,100.0,?
> 1100.0,1100.0,100.0,100.0,?
> 1200.0,1200.0,100.0,100.0,?
> 1300.0,1300.0,100.0,100.0,?
> 1400.0,1400.0,100.0,100.0,?
> 1500.0,1500.0,100.0,100.0,?
> 1600.0,1600.0,100.0,100.0,?
> 1700.0,1700.0,100.0,100.0,?
> 1800.0,1800.0,100.0,100.0,?
> 1900.0,1900.0,100.0,100.0,?



I have read the documentation on the SAMOA project page but I wasn't able
to figure out how to get classification results per instance.
Could you please point me to the right direction in terms of acceptable
formats SAMOA can use as stream input ? Is there a need for a labeled
training set to be included in the data ?

Any examples you could provide me with that are not already in the
documentation would be most welcome!


Kind Regards,

Ilias Bertsimas.

Re: SAMOA getting started help

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.

Yes, by default the last attribute is used as a class.

In theory you just put them both (train and test) in the same file.
In a stream, you would get a mix of labeled and unlabeled instances
continuously.

Currently the only way to query the model is to send an unlabeled instance
on the same channel as the training ones.
Do you have a use case for a different architecture?

As I said, right now SAMOA doesn't save or output the predicted label, but
it's easy to change.
So yes, the label is used only for the statistics right now.

Hope it helps,


--
Gianmarco

On 18 May 2015 at 19:51, Bertsimas Ilias <aw...@gmail.com> wrote:

> Hi Gianmarco,
>
> Yes that dataset doesn't include a label/class as it wasn't aware of the
> format I had to follow to include it. Looking at WEKA MOA I show the option
> of -c -1 which sets the last item of each instance as the class/label.
> I am trying to do sentiment prediction on tweets.
>
> My next question is how do I mix labelled and unlabelled instances in my
> dataset, as to my understanding there is no model saved unlike in SVM batch
> classification.
> Is there another way to query the model by providing instances to test ?
>
> Not storing it's label means the output is also not displayed apart from
> calculating the statistics it outputs per batch and in the final summary ?
>
> Thanks for all the help!
>
> Kind Regards,
>
> Ilias Bertsimas.
>
> On 18 May 2015 at 10:24, Gianmarco De Francisci Morales <gd...@apache.org>
> wrote:
>
> > Hi Ilias,
> >
> > Your data does not have a class, but you are trying to train a
> classifier.
> > What are you trying to predict exactly?
> > Even in SVM you need a class label for your training set.
> > And if you want some accuracy figures, you need a labeled test set as
> well
> > (a ground truth).
> >
> > PrequentialEvaluation works by using unseen data as test data, and then
> > using it for training.
> > That is, each instance is first used as a test instance, then as a
> training
> > instance.
> > Therefore the n-th instance is tested on the model built on the previous
> > n-1 instances.
> > If the instance is unlabeled, it simply predicts its label (although it
> > doesn't store it anywhere as of now, but that's easy to fix).
> >
> > Regarding other classifiers without docs, it's one of these 3 things:
> > 1) they are not fully tested
> > 2) they are not fully parallel (e.g. Naive Bayes right now)
> > 3) simply, the doc is missing
> >
> > Cheers,
> >
> > --
> > Gianmarco
> >
> > On 15 May 2015 at 16:38, Bertsimas Ilias <aw...@gmail.com> wrote:
> >
> > > Hi Gianmarco,
> > >
> > > I hope my reply doesn't create create a separate thread. Sorry in
> advance
> > > if it does, I forgot to subscribe before sending the original message.
> > >
> > > Here's an excerpt from my dataset in sparse array ARFF format:
> > >
> > >
> >
> https://drive.google.com/file/d/0B1WaPw_KXbfkaVJ6T0lnMDFBdmc/view?usp=sharing
> > >
> > > I am coming from an SVM classification paradigm where you first train
> > your
> > > model with a labelled data-set and then test it with a separate
> > unlabelled
> > > data-set.
> > > How would that translate in the streaming online processing paradigm of
> > > SAMOA ?
> > >
> > > I noticed there are a lot of classifications tasks available that are
> not
> > > listed in the documentation is there a reason for that ?
> > >
> > > Kind Regards,
> > >
> > > Ilias Bertsimas.
> > >
> > >
> > > On 13 May 2015 at 14:03, Bertsimas Ilias <aw...@gmail.com> wrote:
> > >
> > > > Hi all!
> > > >
> > > > I am in the process of running some tests for online machine learning
> > in
> > > > data streams from social media. I came across apache-SAMOA and seemed
> > > like
> > > > a very interesting framework.
> > > > However it was not possible to figure out how to get it to test and
> > train
> > > > using a sparse array of tf-idf feature vectors. I provide the data in
> > the
> > > > standard WEKA arff format and although it run, the output is
> something
> > > > along the lines of:
> > > >
> > > > 2015-05-12 22:58:58,993 [main] INFO
> > > >>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
> > > >> (EvaluatorProcessor.java:189) -
> > > >> com.yahoo.labs.samoa.evaluation.EvaluatorProcessorid = 0
> > > >> evaluation instances,classified instances,classifications correct
> > > >> (percent),Kappa Statistic (percent),Kappa Temporal Statistic
> (percent)
> > > >> 100.0,100.0,100.0,100.0,?
> > > >> 200.0,200.0,100.0,100.0,?
> > > >> 300.0,300.0,100.0,100.0,?
> > > >> 400.0,400.0,100.0,100.0,?
> > > >> 500.0,500.0,100.0,100.0,?
> > > >> 600.0,600.0,100.0,100.0,?
> > > >> 700.0,700.0,100.0,100.0,?
> > > >> 800.0,800.0,100.0,100.0,?
> > > >> 900.0,900.0,100.0,100.0,?
> > > >> 1000.0,1000.0,100.0,100.0,?
> > > >> 1100.0,1100.0,100.0,100.0,?
> > > >> 1200.0,1200.0,100.0,100.0,?
> > > >> 1300.0,1300.0,100.0,100.0,?
> > > >> 1400.0,1400.0,100.0,100.0,?
> > > >> 1500.0,1500.0,100.0,100.0,?
> > > >> 1600.0,1600.0,100.0,100.0,?
> > > >> 1700.0,1700.0,100.0,100.0,?
> > > >> 1800.0,1800.0,100.0,100.0,?
> > > >> 1900.0,1900.0,100.0,100.0,?
> > > >
> > > >
> > > >
> > > > I have read the documentation on the SAMOA project page but I wasn't
> > able
> > > > to figure out how to get classification results per instance.
> > > > Could you please point me to the right direction in terms of
> acceptable
> > > > formats SAMOA can use as stream input ? Is there a need for a labeled
> > > > training set to be included in the data ?
> > > >
> > > > Any examples you could provide me with that are not already in the
> > > > documentation would be most welcome!
> > > >
> > > >
> > > > Kind Regards,
> > > >
> > > > Ilias Bertsimas.
> > > >
> > >
> >
>

Re: SAMOA getting started help

Posted by Bertsimas Ilias <aw...@gmail.com>.

Hi Gianmarco,

Yes that dataset doesn't include a label/class as it wasn't aware of the
format I had to follow to include it. Looking at WEKA MOA I show the option
of -c -1 which sets the last item of each instance as the class/label.
I am trying to do sentiment prediction on tweets.

My next question is how do I mix labelled and unlabelled instances in my
dataset, as to my understanding there is no model saved unlike in SVM batch
classification.
Is there another way to query the model by providing instances to test ?

Not storing it's label means the output is also not displayed apart from
calculating the statistics it outputs per batch and in the final summary ?

Thanks for all the help!

Kind Regards,

Ilias Bertsimas.

On 18 May 2015 at 10:24, Gianmarco De Francisci Morales <gd...@apache.org>
wrote:

> Hi Ilias,
>
> Your data does not have a class, but you are trying to train a classifier.
> What are you trying to predict exactly?
> Even in SVM you need a class label for your training set.
> And if you want some accuracy figures, you need a labeled test set as well
> (a ground truth).
>
> PrequentialEvaluation works by using unseen data as test data, and then
> using it for training.
> That is, each instance is first used as a test instance, then as a training
> instance.
> Therefore the n-th instance is tested on the model built on the previous
> n-1 instances.
> If the instance is unlabeled, it simply predicts its label (although it
> doesn't store it anywhere as of now, but that's easy to fix).
>
> Regarding other classifiers without docs, it's one of these 3 things:
> 1) they are not fully tested
> 2) they are not fully parallel (e.g. Naive Bayes right now)
> 3) simply, the doc is missing
>
> Cheers,
>
> --
> Gianmarco
>
> On 15 May 2015 at 16:38, Bertsimas Ilias <aw...@gmail.com> wrote:
>
> > Hi Gianmarco,
> >
> > I hope my reply doesn't create create a separate thread. Sorry in advance
> > if it does, I forgot to subscribe before sending the original message.
> >
> > Here's an excerpt from my dataset in sparse array ARFF format:
> >
> >
> https://drive.google.com/file/d/0B1WaPw_KXbfkaVJ6T0lnMDFBdmc/view?usp=sharing
> >
> > I am coming from an SVM classification paradigm where you first train
> your
> > model with a labelled data-set and then test it with a separate
> unlabelled
> > data-set.
> > How would that translate in the streaming online processing paradigm of
> > SAMOA ?
> >
> > I noticed there are a lot of classifications tasks available that are not
> > listed in the documentation is there a reason for that ?
> >
> > Kind Regards,
> >
> > Ilias Bertsimas.
> >
> >
> > On 13 May 2015 at 14:03, Bertsimas Ilias <aw...@gmail.com> wrote:
> >
> > > Hi all!
> > >
> > > I am in the process of running some tests for online machine learning
> in
> > > data streams from social media. I came across apache-SAMOA and seemed
> > like
> > > a very interesting framework.
> > > However it was not possible to figure out how to get it to test and
> train
> > > using a sparse array of tf-idf feature vectors. I provide the data in
> the
> > > standard WEKA arff format and although it run, the output is something
> > > along the lines of:
> > >
> > > 2015-05-12 22:58:58,993 [main] INFO
> > >>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
> > >> (EvaluatorProcessor.java:189) -
> > >> com.yahoo.labs.samoa.evaluation.EvaluatorProcessorid = 0
> > >> evaluation instances,classified instances,classifications correct
> > >> (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent)
> > >> 100.0,100.0,100.0,100.0,?
> > >> 200.0,200.0,100.0,100.0,?
> > >> 300.0,300.0,100.0,100.0,?
> > >> 400.0,400.0,100.0,100.0,?
> > >> 500.0,500.0,100.0,100.0,?
> > >> 600.0,600.0,100.0,100.0,?
> > >> 700.0,700.0,100.0,100.0,?
> > >> 800.0,800.0,100.0,100.0,?
> > >> 900.0,900.0,100.0,100.0,?
> > >> 1000.0,1000.0,100.0,100.0,?
> > >> 1100.0,1100.0,100.0,100.0,?
> > >> 1200.0,1200.0,100.0,100.0,?
> > >> 1300.0,1300.0,100.0,100.0,?
> > >> 1400.0,1400.0,100.0,100.0,?
> > >> 1500.0,1500.0,100.0,100.0,?
> > >> 1600.0,1600.0,100.0,100.0,?
> > >> 1700.0,1700.0,100.0,100.0,?
> > >> 1800.0,1800.0,100.0,100.0,?
> > >> 1900.0,1900.0,100.0,100.0,?
> > >
> > >
> > >
> > > I have read the documentation on the SAMOA project page but I wasn't
> able
> > > to figure out how to get classification results per instance.
> > > Could you please point me to the right direction in terms of acceptable
> > > formats SAMOA can use as stream input ? Is there a need for a labeled
> > > training set to be included in the data ?
> > >
> > > Any examples you could provide me with that are not already in the
> > > documentation would be most welcome!
> > >
> > >
> > > Kind Regards,
> > >
> > > Ilias Bertsimas.
> > >
> >
>

Re: SAMOA getting started help

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.

Hi Ilias,

Your data does not have a class, but you are trying to train a classifier.
What are you trying to predict exactly?
Even in SVM you need a class label for your training set.
And if you want some accuracy figures, you need a labeled test set as well
(a ground truth).

PrequentialEvaluation works by using unseen data as test data, and then
using it for training.
That is, each instance is first used as a test instance, then as a training
instance.
Therefore the n-th instance is tested on the model built on the previous
n-1 instances.
If the instance is unlabeled, it simply predicts its label (although it
doesn't store it anywhere as of now, but that's easy to fix).

Regarding other classifiers without docs, it's one of these 3 things:
1) they are not fully tested
2) they are not fully parallel (e.g. Naive Bayes right now)
3) simply, the doc is missing

Cheers,

--
Gianmarco

On 15 May 2015 at 16:38, Bertsimas Ilias <aw...@gmail.com> wrote:

> Hi Gianmarco,
>
> I hope my reply doesn't create create a separate thread. Sorry in advance
> if it does, I forgot to subscribe before sending the original message.
>
> Here's an excerpt from my dataset in sparse array ARFF format:
>
> https://drive.google.com/file/d/0B1WaPw_KXbfkaVJ6T0lnMDFBdmc/view?usp=sharing
>
> I am coming from an SVM classification paradigm where you first train your
> model with a labelled data-set and then test it with a separate unlabelled
> data-set.
> How would that translate in the streaming online processing paradigm of
> SAMOA ?
>
> I noticed there are a lot of classifications tasks available that are not
> listed in the documentation is there a reason for that ?
>
> Kind Regards,
>
> Ilias Bertsimas.
>
>
> On 13 May 2015 at 14:03, Bertsimas Ilias <aw...@gmail.com> wrote:
>
> > Hi all!
> >
> > I am in the process of running some tests for online machine learning in
> > data streams from social media. I came across apache-SAMOA and seemed
> like
> > a very interesting framework.
> > However it was not possible to figure out how to get it to test and train
> > using a sparse array of tf-idf feature vectors. I provide the data in the
> > standard WEKA arff format and although it run, the output is something
> > along the lines of:
> >
> > 2015-05-12 22:58:58,993 [main] INFO
> >>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
> >> (EvaluatorProcessor.java:189) -
> >> com.yahoo.labs.samoa.evaluation.EvaluatorProcessorid = 0
> >> evaluation instances,classified instances,classifications correct
> >> (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent)
> >> 100.0,100.0,100.0,100.0,?
> >> 200.0,200.0,100.0,100.0,?
> >> 300.0,300.0,100.0,100.0,?
> >> 400.0,400.0,100.0,100.0,?
> >> 500.0,500.0,100.0,100.0,?
> >> 600.0,600.0,100.0,100.0,?
> >> 700.0,700.0,100.0,100.0,?
> >> 800.0,800.0,100.0,100.0,?
> >> 900.0,900.0,100.0,100.0,?
> >> 1000.0,1000.0,100.0,100.0,?
> >> 1100.0,1100.0,100.0,100.0,?
> >> 1200.0,1200.0,100.0,100.0,?
> >> 1300.0,1300.0,100.0,100.0,?
> >> 1400.0,1400.0,100.0,100.0,?
> >> 1500.0,1500.0,100.0,100.0,?
> >> 1600.0,1600.0,100.0,100.0,?
> >> 1700.0,1700.0,100.0,100.0,?
> >> 1800.0,1800.0,100.0,100.0,?
> >> 1900.0,1900.0,100.0,100.0,?
> >
> >
> >
> > I have read the documentation on the SAMOA project page but I wasn't able
> > to figure out how to get classification results per instance.
> > Could you please point me to the right direction in terms of acceptable
> > formats SAMOA can use as stream input ? Is there a need for a labeled
> > training set to be included in the data ?
> >
> > Any examples you could provide me with that are not already in the
> > documentation would be most welcome!
> >
> >
> > Kind Regards,
> >
> > Ilias Bertsimas.
> >
>

Re: SAMOA getting started help

Posted by Bertsimas Ilias <aw...@gmail.com>.

Hi Gianmarco,

I hope my reply doesn't create create a separate thread. Sorry in advance
if it does, I forgot to subscribe before sending the original message.

Here's an excerpt from my dataset in sparse array ARFF format:
https://drive.google.com/file/d/0B1WaPw_KXbfkaVJ6T0lnMDFBdmc/view?usp=sharing

I am coming from an SVM classification paradigm where you first train your
model with a labelled data-set and then test it with a separate unlabelled
data-set.
How would that translate in the streaming online processing paradigm of
SAMOA ?

I noticed there are a lot of classifications tasks available that are not
listed in the documentation is there a reason for that ?

Kind Regards,

Ilias Bertsimas.


On 13 May 2015 at 14:03, Bertsimas Ilias <aw...@gmail.com> wrote:

> Hi all!
>
> I am in the process of running some tests for online machine learning in
> data streams from social media. I came across apache-SAMOA and seemed like
> a very interesting framework.
> However it was not possible to figure out how to get it to test and train
> using a sparse array of tf-idf feature vectors. I provide the data in the
> standard WEKA arff format and although it run, the output is something
> along the lines of:
>
> 2015-05-12 22:58:58,993 [main] INFO
>>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
>> (EvaluatorProcessor.java:189) -
>> com.yahoo.labs.samoa.evaluation.EvaluatorProcessorid = 0
>> evaluation instances,classified instances,classifications correct
>> (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent)
>> 100.0,100.0,100.0,100.0,?
>> 200.0,200.0,100.0,100.0,?
>> 300.0,300.0,100.0,100.0,?
>> 400.0,400.0,100.0,100.0,?
>> 500.0,500.0,100.0,100.0,?
>> 600.0,600.0,100.0,100.0,?
>> 700.0,700.0,100.0,100.0,?
>> 800.0,800.0,100.0,100.0,?
>> 900.0,900.0,100.0,100.0,?
>> 1000.0,1000.0,100.0,100.0,?
>> 1100.0,1100.0,100.0,100.0,?
>> 1200.0,1200.0,100.0,100.0,?
>> 1300.0,1300.0,100.0,100.0,?
>> 1400.0,1400.0,100.0,100.0,?
>> 1500.0,1500.0,100.0,100.0,?
>> 1600.0,1600.0,100.0,100.0,?
>> 1700.0,1700.0,100.0,100.0,?
>> 1800.0,1800.0,100.0,100.0,?
>> 1900.0,1900.0,100.0,100.0,?
>
>
>
> I have read the documentation on the SAMOA project page but I wasn't able
> to figure out how to get classification results per instance.
> Could you please point me to the right direction in terms of acceptable
> formats SAMOA can use as stream input ? Is there a need for a labeled
> training set to be included in the data ?
>
> Any examples you could provide me with that are not already in the
> documentation would be most welcome!
>
>
> Kind Regards,
>
> Ilias Bertsimas.
>

Re: SAMOA getting started help

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.

Hi Ilias,

Thanks for your interest in SAMOA.

In theory SAMOA should be able to parse the same formats as MOA.
Could you provide a link to your input file?

Not sure I get your second question.
To use the VHT classifier you need to provide labels, as it is a supervised
learning methods.

Cheers,

--
Gianmarco

On 13 May 2015 at 16:03, Bertsimas Ilias <aw...@gmail.com> wrote:

> Hi all!
>
> I am in the process of running some tests for online machine learning in
> data streams from social media. I came across apache-SAMOA and seemed like
> a very interesting framework.
> However it was not possible to figure out how to get it to test and train
> using a sparse array of tf-idf feature vectors. I provide the data in the
> standard WEKA arff format and although it run, the output is something
> along the lines of:
>
> 2015-05-12 22:58:58,993 [main] INFO
> >  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
> > (EvaluatorProcessor.java:189) -
> > com.yahoo.labs.samoa.evaluation.EvaluatorProcessorid = 0
> > evaluation instances,classified instances,classifications correct
> > (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent)
> > 100.0,100.0,100.0,100.0,?
> > 200.0,200.0,100.0,100.0,?
> > 300.0,300.0,100.0,100.0,?
> > 400.0,400.0,100.0,100.0,?
> > 500.0,500.0,100.0,100.0,?
> > 600.0,600.0,100.0,100.0,?
> > 700.0,700.0,100.0,100.0,?
> > 800.0,800.0,100.0,100.0,?
> > 900.0,900.0,100.0,100.0,?
> > 1000.0,1000.0,100.0,100.0,?
> > 1100.0,1100.0,100.0,100.0,?
> > 1200.0,1200.0,100.0,100.0,?
> > 1300.0,1300.0,100.0,100.0,?
> > 1400.0,1400.0,100.0,100.0,?
> > 1500.0,1500.0,100.0,100.0,?
> > 1600.0,1600.0,100.0,100.0,?
> > 1700.0,1700.0,100.0,100.0,?
> > 1800.0,1800.0,100.0,100.0,?
> > 1900.0,1900.0,100.0,100.0,?
>
>
>
> I have read the documentation on the SAMOA project page but I wasn't able
> to figure out how to get classification results per instance.
> Could you please point me to the right direction in terms of acceptable
> formats SAMOA can use as stream input ? Is there a need for a labeled
> training set to be included in the data ?
>
> Any examples you could provide me with that are not already in the
> documentation would be most welcome!
>
>
> Kind Regards,
>
> Ilias Bertsimas.
>