You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samoa.apache.org by Albert Bifet <ab...@apache.org> on 2015/01/10 04:26:46 UTC

New Instances

Hi all,

This is a short explanation of the new instances of SAMOA.

https://github.com/abifet/moa/tree/master/moa/src/main/java/com/yahoo/labs/samoa/instances

Instances will be much simpler than the current implementation. They
can be dense or sparse, and they contain only one array (or two for
sparse) with all the attribute values. In the current implementation
we have two arrays, one for input values and another for output values

The main changes are two:

1/ All instances are going to be multi-label, that means they have
input and output attributes, and we can call their values with
getInputValue(i) and getOutputValue(i).

2/ Attributes are numeric by default, so we only keep information of
discrete attributes (values). For example if we have one million
numeric attributes, we will not need to store attribute information of
these one million numeric attributes.

Basically, we have:

- Instance: interface
- MultiLabelInstance: interface (empty interface that extends Instance)
- InstanceImpl extends MultiLabelInstance: implementation of Instance.
Contains
    - InstanceData
    - InstancesHeader
- DenseInstance extends InstanceImpl
- SparseInstance extends InstanceImpl

-Instances: a list of instances and an InstanceInformation object
-InstancesHeader extends Instances

-InstanceData: interface
-DenseInstanceData implements InstanceData
-SparseInstanceData implements InstanceData

- InstanceInformation contains name, attribute information and
attributes to predict.
- AttributesInformation contains two list of Attributes (indices and
values) for non-numerical attributes. Numerical attributes are by
default
- Range: attributes to predict

Cheers,

Albert

Re: New Instances

Posted by Albert Bifet <ab...@waikato.ac.nz>.
No, still there is no pull request for that.

Yes, it should be possible to add new classes, and new attributes dynamically.

Thanks,

Albert

On Tue, Jan 20, 2015 at 1:17 AM, Matthieu Morel <mm...@apache.org> wrote:
> Is there a pull request somewhere?
>
> One thing I want for example, is to dynamically set the number of
> classes for an instance, as we discover those classes in the stream.
> Hopefully the new instances will allow that.
>
> Thanks,
>
> Matthieu
>
> On Wed, Jan 14, 2015 at 2:34 AM, Albert Bifet <ab...@waikato.ac.nz> wrote:
>> Thanks Gianmarco,
>>
>> 1/ Range contains the information of which are the input and output
>> attributes.  Each instance has an InstancesHeader field that contains an
>> AttributesInformation object.
>>
>> 2/ In the case that there is no metadata information, then all attributes
>> are numeric, right? This seems reasonable.
>>
>> - InstancesHeader contains an InstanceInformation object. We may use
>> InstanceInformation instead of InstancesHeader.
>>
>> - Yes, AttributesInformation can be modified at runtime, adding attributes
>> and values of attributes.
>>
>> Cheers,
>>
>> Albert
>>
>> On Tue, Jan 13, 2015 at 9:18 PM, Gianmarco De Francisci Morales <
>> gdfm@apache.org> wrote:
>>
>>> Thanks Albert.
>>>
>>> I have a couple of questions.
>>>
>>> 1/ how do we distinguish between input and output attributes?
>>> In particular, let's take as an example the default single-label
>>> classification.
>>> I guess that is the role of Range.
>>> However, do we have to serialize it with every instance we send?
>>>
>>> 2/ to distinguish between numeric and categorical we need some metadata,
>>> which I guess goes into InstancesHeader.
>>> I am fine with keeping it also for compatibility with MOA, and we might use
>>> it if we have access to it.
>>> However, I would prefer algorithms not to rely on it, and consider the
>>> presence of metadata optional.
>>>
>>> Some other points:
>>> - what's the difference between InstanceInformation and InstancesHeaders
>>> - can the AttributesInformation be modified at runtime? Or is it statically
>>> set for the whole duration of the algorithm?
>>>
>>> Cheers,
>>>
>>> --
>>> Gianmarco
>>>
>>> On 10 January 2015 at 04:26, Albert Bifet <ab...@apache.org> wrote:
>>>
>>> > Hi all,
>>> >
>>> > This is a short explanation of the new instances of SAMOA.
>>> >
>>> >
>>> >
>>> https://github.com/abifet/moa/tree/master/moa/src/main/java/com/yahoo/labs/samoa/instances
>>> >
>>> > Instances will be much simpler than the current implementation. They
>>> > can be dense or sparse, and they contain only one array (or two for
>>> > sparse) with all the attribute values. In the current implementation
>>> > we have two arrays, one for input values and another for output values
>>> >
>>> > The main changes are two:
>>> >
>>> > 1/ All instances are going to be multi-label, that means they have
>>> > input and output attributes, and we can call their values with
>>> > getInputValue(i) and getOutputValue(i).
>>> >
>>> > 2/ Attributes are numeric by default, so we only keep information of
>>> > discrete attributes (values). For example if we have one million
>>> > numeric attributes, we will not need to store attribute information of
>>> > these one million numeric attributes.
>>> >
>>> > Basically, we have:
>>> >
>>> > - Instance: interface
>>> > - MultiLabelInstance: interface (empty interface that extends Instance)
>>> > - InstanceImpl extends MultiLabelInstance: implementation of Instance.
>>> > Contains
>>> >     - InstanceData
>>> >     - InstancesHeader
>>> > - DenseInstance extends InstanceImpl
>>> > - SparseInstance extends InstanceImpl
>>> >
>>> > -Instances: a list of instances and an InstanceInformation object
>>> > -InstancesHeader extends Instances
>>> >
>>> > -InstanceData: interface
>>> > -DenseInstanceData implements InstanceData
>>> > -SparseInstanceData implements InstanceData
>>> >
>>> > - InstanceInformation contains name, attribute information and
>>> > attributes to predict.
>>> > - AttributesInformation contains two list of Attributes (indices and
>>> > values) for non-numerical attributes. Numerical attributes are by
>>> > default
>>> > - Range: attributes to predict
>>> >
>>> > Cheers,
>>> >
>>> > Albert
>>> >
>>>

Re: New Instances

Posted by Matthieu Morel <mm...@apache.org>.
Hi, I won't be doing updates at the moment.

Note that this won't merge directly. One reason if I remember
correctly is that in contrast to the existing implementation, the
instances in this proposal do not keep the whole history. It's really
intended for streaming.

Matthieu

On Mon, Mar 30, 2015 at 10:26 AM, Gianmarco De Francisci Morales
<gd...@apache.org> wrote:
> Hi,
>
> I think we should restart this conversation.
> Matthieu, do you think we can review the branch?
> Or do you want to do any update on it before?
>
> Cheers,
>
> --
> Gianmarco
>
> On 26 January 2015 at 16:20, Albert Bifet <ab...@waikato.ac.nz> wrote:
>
>> Hi Matthieu,
>>
>> Thanks for your answers! I agree with using double values to store
>> attribute information. I think we need to define how to maintain the
>> mapping, as some learners need to know if attributes are discrete or
>> numeric, in order to learn and do predictions, and how many values  the
>> discrete attributes have.
>>
>> Cheers, Albert
>>
>> On Mon, Jan 26, 2015 at 7:33 PM, Matthieu Morel <mm...@apache.org> wrote:
>>
>> > - discrete attributes are eventually mapped to double values, and
>> > that's the appropriate input to instances, in my understanding. My
>> > idea was to maintain the mapping in the feature extraction step, and
>> > share it in some way with the processing topology.
>> >
>> > - regarding performance in sparse instances, I haven't done any sort
>> > of benchmark yet. The implementation can be changed while keeping the
>> > same API.
>> > From what I see, on the one hand, in the current approach using an
>> > index array, we have the extra constraints that 1/ this index array
>> > must be sorted (adds building time), and 2/ we have to do a binary
>> > search for the index value (log(n)).
>> > On the other hand, there are some very efficient map implementations
>> > that we could reuse. For example, CERN's colt package, actually
>> > already imported in the mahout-collections ASF package.
>> >
>> > I hope this answers your questions,
>> >
>> > Matthieu
>> >
>> >
>> > On Mon, Jan 26, 2015 at 7:30 AM, Albert Bifet <ab...@waikato.ac.nz>
>> > wrote:
>> > > Nice and simple API! Some things to comment:
>> > >
>> > > - how can we manage discrete attributes, for example attribute class:
>> > > "+","-"?
>> > >
>> > > - In sparse instances, is the performance of a map similar to the
>> > > performance of two arrays, one for indices and one for values?
>> > >
>> > > Albert
>> > >
>> > > On Sat, Jan 24, 2015 at 1:38 AM, Matthieu Morel <
>> > matthieu.morel@gmail.com>
>> > > wrote:
>> > >
>> > >> I took a shot at drafting a simplified API for instances.
>> > >> https://github.com/matthieumorel/samoa/tree/new-instances
>> > >>
>> > >> As pointed out in this thread, the current API is too exhaustive, too
>> > >> tied to a specific implementation, and too tied to WEKA/MOA APIs.
>> > >>
>> > >> In addition, I feel the header/information does not belong to the
>> > >> instance. This is something which is used when parsing arff files
>> > >> where static information about the stream is available from the start.
>> > >> But for a real streaming use case, we should not make such assumption.
>> > >> Whatever is known at the begining should be loaded by the topology,
>> > >> but not included in the instances. Arff files can still be loaded and
>> > >> generate instances in the new format. Only the headers should be
>> > >> parsed separately.
>> > >>
>> > >> This proposal is a draft and single label only. It should be easy to
>> > >> add functionality suggested by Albert for multi labels.
>> > >>
>> > >> Feel free to comment!
>> > >>
>> > >> Matthieu
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <ab...@waikato.ac.nz>
>> > >> wrote:
>> > >> > 1/ Learners as decision trees can deal with new instances that
>> arrive
>> > >> > with more label classes. New instances can arrive with new headers.
>> > >> >
>> > >> > 2/ To change class labels dynamically, we need to add a method
>> > >> > "setValue(int, string)" in the Attribute class to add dynamically
>> new
>> > >> > attribute values.
>> > >> >
>> > >> > 3/ The current design is being compatible with the methods in weka
>> > >> > instances. It could be nice to have a fresher design. I will need
>> some
>> > >> > help to have a simplified and fresher design of the instances as
>> I'm a
>> > >> > bit conditioned by the previous instance usage :)
>> > >> >
>> > >> > Thanks,
>> > >> >
>> > >> > Albert
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
>> > >> > <ol...@gmail.com> wrote:
>> > >> >> Hey Matthieu,
>> > >> >>
>> > >> >>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <
>> > matthieu.morel@gmail.com>
>> > >> wrote:
>> > >> >>>
>> > >> >>> I'm confused. From what I see the number of classes is currently
>> > fixed
>> > >> >>> in the instance header. As if it was static. I suppose you can
>> work
>> > >> >>> around that limitation with some hacks but I want to use a clean
>> API
>> > >> >>> for that.
>> > >> >>>
>> > >> >>> Or is there a recommended way I'm missing?
>> > >> >>
>> > >> >> Ah, I think I remember now what happened. As far as I encountered
>> > this
>> > >> situation, the data had say an .arff format with a header stating the
>> > >> number of class values, and the instance header was read from that,
>> > while
>> > >> the instances were then read by the line.
>> > >> >>
>> > >> >> I worked around that by just storing the class label seen in the
>> > >> instances on the fly when building a model, and ignored that field of
>> > the
>> > >> instance header. Sorry for the confusion!
>> > >> >>
>> > >> >> Cheers,
>> > >> >> Olivier
>> > >> >>
>> > >> >>
>> > >>
>> >
>>

Re: New Instances

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Hi,

I think we should restart this conversation.
Matthieu, do you think we can review the branch?
Or do you want to do any update on it before?

Cheers,

--
Gianmarco

On 26 January 2015 at 16:20, Albert Bifet <ab...@waikato.ac.nz> wrote:

> Hi Matthieu,
>
> Thanks for your answers! I agree with using double values to store
> attribute information. I think we need to define how to maintain the
> mapping, as some learners need to know if attributes are discrete or
> numeric, in order to learn and do predictions, and how many values  the
> discrete attributes have.
>
> Cheers, Albert
>
> On Mon, Jan 26, 2015 at 7:33 PM, Matthieu Morel <mm...@apache.org> wrote:
>
> > - discrete attributes are eventually mapped to double values, and
> > that's the appropriate input to instances, in my understanding. My
> > idea was to maintain the mapping in the feature extraction step, and
> > share it in some way with the processing topology.
> >
> > - regarding performance in sparse instances, I haven't done any sort
> > of benchmark yet. The implementation can be changed while keeping the
> > same API.
> > From what I see, on the one hand, in the current approach using an
> > index array, we have the extra constraints that 1/ this index array
> > must be sorted (adds building time), and 2/ we have to do a binary
> > search for the index value (log(n)).
> > On the other hand, there are some very efficient map implementations
> > that we could reuse. For example, CERN's colt package, actually
> > already imported in the mahout-collections ASF package.
> >
> > I hope this answers your questions,
> >
> > Matthieu
> >
> >
> > On Mon, Jan 26, 2015 at 7:30 AM, Albert Bifet <ab...@waikato.ac.nz>
> > wrote:
> > > Nice and simple API! Some things to comment:
> > >
> > > - how can we manage discrete attributes, for example attribute class:
> > > "+","-"?
> > >
> > > - In sparse instances, is the performance of a map similar to the
> > > performance of two arrays, one for indices and one for values?
> > >
> > > Albert
> > >
> > > On Sat, Jan 24, 2015 at 1:38 AM, Matthieu Morel <
> > matthieu.morel@gmail.com>
> > > wrote:
> > >
> > >> I took a shot at drafting a simplified API for instances.
> > >> https://github.com/matthieumorel/samoa/tree/new-instances
> > >>
> > >> As pointed out in this thread, the current API is too exhaustive, too
> > >> tied to a specific implementation, and too tied to WEKA/MOA APIs.
> > >>
> > >> In addition, I feel the header/information does not belong to the
> > >> instance. This is something which is used when parsing arff files
> > >> where static information about the stream is available from the start.
> > >> But for a real streaming use case, we should not make such assumption.
> > >> Whatever is known at the begining should be loaded by the topology,
> > >> but not included in the instances. Arff files can still be loaded and
> > >> generate instances in the new format. Only the headers should be
> > >> parsed separately.
> > >>
> > >> This proposal is a draft and single label only. It should be easy to
> > >> add functionality suggested by Albert for multi labels.
> > >>
> > >> Feel free to comment!
> > >>
> > >> Matthieu
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <ab...@waikato.ac.nz>
> > >> wrote:
> > >> > 1/ Learners as decision trees can deal with new instances that
> arrive
> > >> > with more label classes. New instances can arrive with new headers.
> > >> >
> > >> > 2/ To change class labels dynamically, we need to add a method
> > >> > "setValue(int, string)" in the Attribute class to add dynamically
> new
> > >> > attribute values.
> > >> >
> > >> > 3/ The current design is being compatible with the methods in weka
> > >> > instances. It could be nice to have a fresher design. I will need
> some
> > >> > help to have a simplified and fresher design of the instances as
> I'm a
> > >> > bit conditioned by the previous instance usage :)
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Albert
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
> > >> > <ol...@gmail.com> wrote:
> > >> >> Hey Matthieu,
> > >> >>
> > >> >>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <
> > matthieu.morel@gmail.com>
> > >> wrote:
> > >> >>>
> > >> >>> I'm confused. From what I see the number of classes is currently
> > fixed
> > >> >>> in the instance header. As if it was static. I suppose you can
> work
> > >> >>> around that limitation with some hacks but I want to use a clean
> API
> > >> >>> for that.
> > >> >>>
> > >> >>> Or is there a recommended way I'm missing?
> > >> >>
> > >> >> Ah, I think I remember now what happened. As far as I encountered
> > this
> > >> situation, the data had say an .arff format with a header stating the
> > >> number of class values, and the instance header was read from that,
> > while
> > >> the instances were then read by the line.
> > >> >>
> > >> >> I worked around that by just storing the class label seen in the
> > >> instances on the fly when building a model, and ignored that field of
> > the
> > >> instance header. Sorry for the confusion!
> > >> >>
> > >> >> Cheers,
> > >> >> Olivier
> > >> >>
> > >> >>
> > >>
> >
>

Re: New Instances

Posted by Albert Bifet <ab...@waikato.ac.nz>.
Hi Matthieu,

Thanks for your answers! I agree with using double values to store
attribute information. I think we need to define how to maintain the
mapping, as some learners need to know if attributes are discrete or
numeric, in order to learn and do predictions, and how many values  the
discrete attributes have.

Cheers, Albert

On Mon, Jan 26, 2015 at 7:33 PM, Matthieu Morel <mm...@apache.org> wrote:

> - discrete attributes are eventually mapped to double values, and
> that's the appropriate input to instances, in my understanding. My
> idea was to maintain the mapping in the feature extraction step, and
> share it in some way with the processing topology.
>
> - regarding performance in sparse instances, I haven't done any sort
> of benchmark yet. The implementation can be changed while keeping the
> same API.
> From what I see, on the one hand, in the current approach using an
> index array, we have the extra constraints that 1/ this index array
> must be sorted (adds building time), and 2/ we have to do a binary
> search for the index value (log(n)).
> On the other hand, there are some very efficient map implementations
> that we could reuse. For example, CERN's colt package, actually
> already imported in the mahout-collections ASF package.
>
> I hope this answers your questions,
>
> Matthieu
>
>
> On Mon, Jan 26, 2015 at 7:30 AM, Albert Bifet <ab...@waikato.ac.nz>
> wrote:
> > Nice and simple API! Some things to comment:
> >
> > - how can we manage discrete attributes, for example attribute class:
> > "+","-"?
> >
> > - In sparse instances, is the performance of a map similar to the
> > performance of two arrays, one for indices and one for values?
> >
> > Albert
> >
> > On Sat, Jan 24, 2015 at 1:38 AM, Matthieu Morel <
> matthieu.morel@gmail.com>
> > wrote:
> >
> >> I took a shot at drafting a simplified API for instances.
> >> https://github.com/matthieumorel/samoa/tree/new-instances
> >>
> >> As pointed out in this thread, the current API is too exhaustive, too
> >> tied to a specific implementation, and too tied to WEKA/MOA APIs.
> >>
> >> In addition, I feel the header/information does not belong to the
> >> instance. This is something which is used when parsing arff files
> >> where static information about the stream is available from the start.
> >> But for a real streaming use case, we should not make such assumption.
> >> Whatever is known at the begining should be loaded by the topology,
> >> but not included in the instances. Arff files can still be loaded and
> >> generate instances in the new format. Only the headers should be
> >> parsed separately.
> >>
> >> This proposal is a draft and single label only. It should be easy to
> >> add functionality suggested by Albert for multi labels.
> >>
> >> Feel free to comment!
> >>
> >> Matthieu
> >>
> >>
> >>
> >>
> >> On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <ab...@waikato.ac.nz>
> >> wrote:
> >> > 1/ Learners as decision trees can deal with new instances that arrive
> >> > with more label classes. New instances can arrive with new headers.
> >> >
> >> > 2/ To change class labels dynamically, we need to add a method
> >> > "setValue(int, string)" in the Attribute class to add dynamically new
> >> > attribute values.
> >> >
> >> > 3/ The current design is being compatible with the methods in weka
> >> > instances. It could be nice to have a fresher design. I will need some
> >> > help to have a simplified and fresher design of the instances as I'm a
> >> > bit conditioned by the previous instance usage :)
> >> >
> >> > Thanks,
> >> >
> >> > Albert
> >> >
> >> >
> >> >
> >> > On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
> >> > <ol...@gmail.com> wrote:
> >> >> Hey Matthieu,
> >> >>
> >> >>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <
> matthieu.morel@gmail.com>
> >> wrote:
> >> >>>
> >> >>> I'm confused. From what I see the number of classes is currently
> fixed
> >> >>> in the instance header. As if it was static. I suppose you can work
> >> >>> around that limitation with some hacks but I want to use a clean API
> >> >>> for that.
> >> >>>
> >> >>> Or is there a recommended way I'm missing?
> >> >>
> >> >> Ah, I think I remember now what happened. As far as I encountered
> this
> >> situation, the data had say an .arff format with a header stating the
> >> number of class values, and the instance header was read from that,
> while
> >> the instances were then read by the line.
> >> >>
> >> >> I worked around that by just storing the class label seen in the
> >> instances on the fly when building a model, and ignored that field of
> the
> >> instance header. Sorry for the confusion!
> >> >>
> >> >> Cheers,
> >> >> Olivier
> >> >>
> >> >>
> >>
>

Re: New Instances

Posted by Matthieu Morel <mm...@apache.org>.
- discrete attributes are eventually mapped to double values, and
that's the appropriate input to instances, in my understanding. My
idea was to maintain the mapping in the feature extraction step, and
share it in some way with the processing topology.

- regarding performance in sparse instances, I haven't done any sort
of benchmark yet. The implementation can be changed while keeping the
same API.
>From what I see, on the one hand, in the current approach using an
index array, we have the extra constraints that 1/ this index array
must be sorted (adds building time), and 2/ we have to do a binary
search for the index value (log(n)).
On the other hand, there are some very efficient map implementations
that we could reuse. For example, CERN's colt package, actually
already imported in the mahout-collections ASF package.

I hope this answers your questions,

Matthieu


On Mon, Jan 26, 2015 at 7:30 AM, Albert Bifet <ab...@waikato.ac.nz> wrote:
> Nice and simple API! Some things to comment:
>
> - how can we manage discrete attributes, for example attribute class:
> "+","-"?
>
> - In sparse instances, is the performance of a map similar to the
> performance of two arrays, one for indices and one for values?
>
> Albert
>
> On Sat, Jan 24, 2015 at 1:38 AM, Matthieu Morel <ma...@gmail.com>
> wrote:
>
>> I took a shot at drafting a simplified API for instances.
>> https://github.com/matthieumorel/samoa/tree/new-instances
>>
>> As pointed out in this thread, the current API is too exhaustive, too
>> tied to a specific implementation, and too tied to WEKA/MOA APIs.
>>
>> In addition, I feel the header/information does not belong to the
>> instance. This is something which is used when parsing arff files
>> where static information about the stream is available from the start.
>> But for a real streaming use case, we should not make such assumption.
>> Whatever is known at the begining should be loaded by the topology,
>> but not included in the instances. Arff files can still be loaded and
>> generate instances in the new format. Only the headers should be
>> parsed separately.
>>
>> This proposal is a draft and single label only. It should be easy to
>> add functionality suggested by Albert for multi labels.
>>
>> Feel free to comment!
>>
>> Matthieu
>>
>>
>>
>>
>> On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <ab...@waikato.ac.nz>
>> wrote:
>> > 1/ Learners as decision trees can deal with new instances that arrive
>> > with more label classes. New instances can arrive with new headers.
>> >
>> > 2/ To change class labels dynamically, we need to add a method
>> > "setValue(int, string)" in the Attribute class to add dynamically new
>> > attribute values.
>> >
>> > 3/ The current design is being compatible with the methods in weka
>> > instances. It could be nice to have a fresher design. I will need some
>> > help to have a simplified and fresher design of the instances as I'm a
>> > bit conditioned by the previous instance usage :)
>> >
>> > Thanks,
>> >
>> > Albert
>> >
>> >
>> >
>> > On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
>> > <ol...@gmail.com> wrote:
>> >> Hey Matthieu,
>> >>
>> >>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <ma...@gmail.com>
>> wrote:
>> >>>
>> >>> I'm confused. From what I see the number of classes is currently fixed
>> >>> in the instance header. As if it was static. I suppose you can work
>> >>> around that limitation with some hacks but I want to use a clean API
>> >>> for that.
>> >>>
>> >>> Or is there a recommended way I'm missing?
>> >>
>> >> Ah, I think I remember now what happened. As far as I encountered this
>> situation, the data had say an .arff format with a header stating the
>> number of class values, and the instance header was read from that, while
>> the instances were then read by the line.
>> >>
>> >> I worked around that by just storing the class label seen in the
>> instances on the fly when building a model, and ignored that field of the
>> instance header. Sorry for the confusion!
>> >>
>> >> Cheers,
>> >> Olivier
>> >>
>> >>
>>

Re: New Instances

Posted by Albert Bifet <ab...@waikato.ac.nz>.
Nice and simple API! Some things to comment:

- how can we manage discrete attributes, for example attribute class:
"+","-"?

- In sparse instances, is the performance of a map similar to the
performance of two arrays, one for indices and one for values?

Albert

On Sat, Jan 24, 2015 at 1:38 AM, Matthieu Morel <ma...@gmail.com>
wrote:

> I took a shot at drafting a simplified API for instances.
> https://github.com/matthieumorel/samoa/tree/new-instances
>
> As pointed out in this thread, the current API is too exhaustive, too
> tied to a specific implementation, and too tied to WEKA/MOA APIs.
>
> In addition, I feel the header/information does not belong to the
> instance. This is something which is used when parsing arff files
> where static information about the stream is available from the start.
> But for a real streaming use case, we should not make such assumption.
> Whatever is known at the begining should be loaded by the topology,
> but not included in the instances. Arff files can still be loaded and
> generate instances in the new format. Only the headers should be
> parsed separately.
>
> This proposal is a draft and single label only. It should be easy to
> add functionality suggested by Albert for multi labels.
>
> Feel free to comment!
>
> Matthieu
>
>
>
>
> On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <ab...@waikato.ac.nz>
> wrote:
> > 1/ Learners as decision trees can deal with new instances that arrive
> > with more label classes. New instances can arrive with new headers.
> >
> > 2/ To change class labels dynamically, we need to add a method
> > "setValue(int, string)" in the Attribute class to add dynamically new
> > attribute values.
> >
> > 3/ The current design is being compatible with the methods in weka
> > instances. It could be nice to have a fresher design. I will need some
> > help to have a simplified and fresher design of the instances as I'm a
> > bit conditioned by the previous instance usage :)
> >
> > Thanks,
> >
> > Albert
> >
> >
> >
> > On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
> > <ol...@gmail.com> wrote:
> >> Hey Matthieu,
> >>
> >>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <ma...@gmail.com>
> wrote:
> >>>
> >>> I'm confused. From what I see the number of classes is currently fixed
> >>> in the instance header. As if it was static. I suppose you can work
> >>> around that limitation with some hacks but I want to use a clean API
> >>> for that.
> >>>
> >>> Or is there a recommended way I'm missing?
> >>
> >> Ah, I think I remember now what happened. As far as I encountered this
> situation, the data had say an .arff format with a header stating the
> number of class values, and the instance header was read from that, while
> the instances were then read by the line.
> >>
> >> I worked around that by just storing the class label seen in the
> instances on the fly when building a model, and ignored that field of the
> instance header. Sorry for the confusion!
> >>
> >> Cheers,
> >> Olivier
> >>
> >>
>

Re: New Instances

Posted by Matthieu Morel <ma...@gmail.com>.
I took a shot at drafting a simplified API for instances.
https://github.com/matthieumorel/samoa/tree/new-instances

As pointed out in this thread, the current API is too exhaustive, too
tied to a specific implementation, and too tied to WEKA/MOA APIs.

In addition, I feel the header/information does not belong to the
instance. This is something which is used when parsing arff files
where static information about the stream is available from the start.
But for a real streaming use case, we should not make such assumption.
Whatever is known at the begining should be loaded by the topology,
but not included in the instances. Arff files can still be loaded and
generate instances in the new format. Only the headers should be
parsed separately.

This proposal is a draft and single label only. It should be easy to
add functionality suggested by Albert for multi labels.

Feel free to comment!

Matthieu




On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <ab...@waikato.ac.nz> wrote:
> 1/ Learners as decision trees can deal with new instances that arrive
> with more label classes. New instances can arrive with new headers.
>
> 2/ To change class labels dynamically, we need to add a method
> "setValue(int, string)" in the Attribute class to add dynamically new
> attribute values.
>
> 3/ The current design is being compatible with the methods in weka
> instances. It could be nice to have a fresher design. I will need some
> help to have a simplified and fresher design of the instances as I'm a
> bit conditioned by the previous instance usage :)
>
> Thanks,
>
> Albert
>
>
>
> On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
> <ol...@gmail.com> wrote:
>> Hey Matthieu,
>>
>>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <ma...@gmail.com> wrote:
>>>
>>> I'm confused. From what I see the number of classes is currently fixed
>>> in the instance header. As if it was static. I suppose you can work
>>> around that limitation with some hacks but I want to use a clean API
>>> for that.
>>>
>>> Or is there a recommended way I'm missing?
>>
>> Ah, I think I remember now what happened. As far as I encountered this situation, the data had say an .arff format with a header stating the number of class values, and the instance header was read from that, while the instances were then read by the line.
>>
>> I worked around that by just storing the class label seen in the instances on the fly when building a model, and ignored that field of the instance header. Sorry for the confusion!
>>
>> Cheers,
>> Olivier
>>
>>

Re: New Instances

Posted by Albert Bifet <ab...@waikato.ac.nz>.
1/ Learners as decision trees can deal with new instances that arrive
with more label classes. New instances can arrive with new headers.

2/ To change class labels dynamically, we need to add a method
"setValue(int, string)" in the Attribute class to add dynamically new
attribute values.

3/ The current design is being compatible with the methods in weka
instances. It could be nice to have a fresher design. I will need some
help to have a simplified and fresher design of the instances as I'm a
bit conditioned by the previous instance usage :)

Thanks,

Albert



On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
<ol...@gmail.com> wrote:
> Hey Matthieu,
>
>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <ma...@gmail.com> wrote:
>>
>> I'm confused. From what I see the number of classes is currently fixed
>> in the instance header. As if it was static. I suppose you can work
>> around that limitation with some hacks but I want to use a clean API
>> for that.
>>
>> Or is there a recommended way I'm missing?
>
> Ah, I think I remember now what happened. As far as I encountered this situation, the data had say an .arff format with a header stating the number of class values, and the instance header was read from that, while the instances were then read by the line.
>
> I worked around that by just storing the class label seen in the instances on the fly when building a model, and ignored that field of the instance header. Sorry for the confusion!
>
> Cheers,
> Olivier
>
>

Re: New Instances

Posted by Olivier Van Laere <ol...@gmail.com>.
Hey Matthieu,

> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <ma...@gmail.com> wrote:
> 
> I'm confused. From what I see the number of classes is currently fixed
> in the instance header. As if it was static. I suppose you can work
> around that limitation with some hacks but I want to use a clean API
> for that.
> 
> Or is there a recommended way I'm missing?

Ah, I think I remember now what happened. As far as I encountered this situation, the data had say an .arff format with a header stating the number of class values, and the instance header was read from that, while the instances were then read by the line. 

I worked around that by just storing the class label seen in the instances on the fly when building a model, and ignored that field of the instance header. Sorry for the confusion!

Cheers,
Olivier



Re: New Instances

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Hi,

I think the main interface (Instance) is very complex.
It has a very large number of methods, and the reason behind some of them,
and how they relate to each other, is obscure to me.

For example, what is classAttribute supposed to do when we have a
multi-target instance?
What about sparse-only methods when called on dense instances?
What is the semantic of insertAttributeAt and how is it different from
setValue?

I have other doubts, but maybe it's just lack of documentation/comments in
the code.

In general, what about having a hierarchy of interfaces for Instances,
rather than putting all the functionality in a single one?

Cheers,


--
Gianmarco

On 20 January 2015 at 10:47, Matthieu Morel <ma...@gmail.com>
wrote:

> I'm confused. From what I see the number of classes is currently fixed
> in the instance header. As if it was static. I suppose you can work
> around that limitation with some hacks but I want to use a clean API
> for that.
>
> Or is there a recommended way I'm missing?
>
> Thanks,
>
> Matthieu
>
> On Tue, Jan 20, 2015 at 7:17 AM, Albert Bifet <ab...@waikato.ac.nz>
> wrote:
> > Olivier, you're right, that was also possible in the old implementation.
> >
> > Cheers,
> > Albert
> >
> > On Tue, Jan 20, 2015 at 1:45 PM, Olivier Van Laere
> > <ol...@gmail.com> wrote:
> >> Hey,
> >>
> >>> One thing I want for example, is to dynamically set the number of
> >>> classes for an instance, as we discover those classes in the stream.
> >>> Hopefully the new instances will allow that.
> >>
> >> Out of interest: how was this not possible in the old version? Did the
> instances themselves store info about the number of classes? I’m asking as
> I remember having dynamic expansion of the potential class labels in the
> Naive Bayes model, but there it is the model that keeps track of which
> labels have been seen so far, by reading that data out of the instances.
> >>
> >> Thanks for the update Albert, seems an interesting improvement over the
> current implementation.
> >>
> >> Cheers,
> >> Olivier
>

Re: New Instances

Posted by Matthieu Morel <ma...@gmail.com>.
I'm confused. From what I see the number of classes is currently fixed
in the instance header. As if it was static. I suppose you can work
around that limitation with some hacks but I want to use a clean API
for that.

Or is there a recommended way I'm missing?

Thanks,

Matthieu

On Tue, Jan 20, 2015 at 7:17 AM, Albert Bifet <ab...@waikato.ac.nz> wrote:
> Olivier, you're right, that was also possible in the old implementation.
>
> Cheers,
> Albert
>
> On Tue, Jan 20, 2015 at 1:45 PM, Olivier Van Laere
> <ol...@gmail.com> wrote:
>> Hey,
>>
>>> One thing I want for example, is to dynamically set the number of
>>> classes for an instance, as we discover those classes in the stream.
>>> Hopefully the new instances will allow that.
>>
>> Out of interest: how was this not possible in the old version? Did the instances themselves store info about the number of classes? I’m asking as I remember having dynamic expansion of the potential class labels in the Naive Bayes model, but there it is the model that keeps track of which labels have been seen so far, by reading that data out of the instances.
>>
>> Thanks for the update Albert, seems an interesting improvement over the current implementation.
>>
>> Cheers,
>> Olivier

Re: New Instances

Posted by Albert Bifet <ab...@waikato.ac.nz>.
Olivier, you're right, that was also possible in the old implementation.

Cheers,
Albert

On Tue, Jan 20, 2015 at 1:45 PM, Olivier Van Laere
<ol...@gmail.com> wrote:
> Hey,
>
>> One thing I want for example, is to dynamically set the number of
>> classes for an instance, as we discover those classes in the stream.
>> Hopefully the new instances will allow that.
>
> Out of interest: how was this not possible in the old version? Did the instances themselves store info about the number of classes? I’m asking as I remember having dynamic expansion of the potential class labels in the Naive Bayes model, but there it is the model that keeps track of which labels have been seen so far, by reading that data out of the instances.
>
> Thanks for the update Albert, seems an interesting improvement over the current implementation.
>
> Cheers,
> Olivier

Re: New Instances

Posted by Olivier Van Laere <ol...@gmail.com>.
Hey,

> One thing I want for example, is to dynamically set the number of
> classes for an instance, as we discover those classes in the stream.
> Hopefully the new instances will allow that.

Out of interest: how was this not possible in the old version? Did the instances themselves store info about the number of classes? I’m asking as I remember having dynamic expansion of the potential class labels in the Naive Bayes model, but there it is the model that keeps track of which labels have been seen so far, by reading that data out of the instances.

Thanks for the update Albert, seems an interesting improvement over the current implementation.

Cheers,
Olivier

Re: New Instances

Posted by Matthieu Morel <mm...@apache.org>.
Is there a pull request somewhere?

One thing I want for example, is to dynamically set the number of
classes for an instance, as we discover those classes in the stream.
Hopefully the new instances will allow that.

Thanks,

Matthieu

On Wed, Jan 14, 2015 at 2:34 AM, Albert Bifet <ab...@waikato.ac.nz> wrote:
> Thanks Gianmarco,
>
> 1/ Range contains the information of which are the input and output
> attributes.  Each instance has an InstancesHeader field that contains an
> AttributesInformation object.
>
> 2/ In the case that there is no metadata information, then all attributes
> are numeric, right? This seems reasonable.
>
> - InstancesHeader contains an InstanceInformation object. We may use
> InstanceInformation instead of InstancesHeader.
>
> - Yes, AttributesInformation can be modified at runtime, adding attributes
> and values of attributes.
>
> Cheers,
>
> Albert
>
> On Tue, Jan 13, 2015 at 9:18 PM, Gianmarco De Francisci Morales <
> gdfm@apache.org> wrote:
>
>> Thanks Albert.
>>
>> I have a couple of questions.
>>
>> 1/ how do we distinguish between input and output attributes?
>> In particular, let's take as an example the default single-label
>> classification.
>> I guess that is the role of Range.
>> However, do we have to serialize it with every instance we send?
>>
>> 2/ to distinguish between numeric and categorical we need some metadata,
>> which I guess goes into InstancesHeader.
>> I am fine with keeping it also for compatibility with MOA, and we might use
>> it if we have access to it.
>> However, I would prefer algorithms not to rely on it, and consider the
>> presence of metadata optional.
>>
>> Some other points:
>> - what's the difference between InstanceInformation and InstancesHeaders
>> - can the AttributesInformation be modified at runtime? Or is it statically
>> set for the whole duration of the algorithm?
>>
>> Cheers,
>>
>> --
>> Gianmarco
>>
>> On 10 January 2015 at 04:26, Albert Bifet <ab...@apache.org> wrote:
>>
>> > Hi all,
>> >
>> > This is a short explanation of the new instances of SAMOA.
>> >
>> >
>> >
>> https://github.com/abifet/moa/tree/master/moa/src/main/java/com/yahoo/labs/samoa/instances
>> >
>> > Instances will be much simpler than the current implementation. They
>> > can be dense or sparse, and they contain only one array (or two for
>> > sparse) with all the attribute values. In the current implementation
>> > we have two arrays, one for input values and another for output values
>> >
>> > The main changes are two:
>> >
>> > 1/ All instances are going to be multi-label, that means they have
>> > input and output attributes, and we can call their values with
>> > getInputValue(i) and getOutputValue(i).
>> >
>> > 2/ Attributes are numeric by default, so we only keep information of
>> > discrete attributes (values). For example if we have one million
>> > numeric attributes, we will not need to store attribute information of
>> > these one million numeric attributes.
>> >
>> > Basically, we have:
>> >
>> > - Instance: interface
>> > - MultiLabelInstance: interface (empty interface that extends Instance)
>> > - InstanceImpl extends MultiLabelInstance: implementation of Instance.
>> > Contains
>> >     - InstanceData
>> >     - InstancesHeader
>> > - DenseInstance extends InstanceImpl
>> > - SparseInstance extends InstanceImpl
>> >
>> > -Instances: a list of instances and an InstanceInformation object
>> > -InstancesHeader extends Instances
>> >
>> > -InstanceData: interface
>> > -DenseInstanceData implements InstanceData
>> > -SparseInstanceData implements InstanceData
>> >
>> > - InstanceInformation contains name, attribute information and
>> > attributes to predict.
>> > - AttributesInformation contains two list of Attributes (indices and
>> > values) for non-numerical attributes. Numerical attributes are by
>> > default
>> > - Range: attributes to predict
>> >
>> > Cheers,
>> >
>> > Albert
>> >
>>

Re: New Instances

Posted by Albert Bifet <ab...@waikato.ac.nz>.
Thanks Gianmarco,

1/ Range contains the information of which are the input and output
attributes.  Each instance has an InstancesHeader field that contains an
AttributesInformation object.

2/ In the case that there is no metadata information, then all attributes
are numeric, right? This seems reasonable.

- InstancesHeader contains an InstanceInformation object. We may use
InstanceInformation instead of InstancesHeader.

- Yes, AttributesInformation can be modified at runtime, adding attributes
and values of attributes.

Cheers,

Albert

On Tue, Jan 13, 2015 at 9:18 PM, Gianmarco De Francisci Morales <
gdfm@apache.org> wrote:

> Thanks Albert.
>
> I have a couple of questions.
>
> 1/ how do we distinguish between input and output attributes?
> In particular, let's take as an example the default single-label
> classification.
> I guess that is the role of Range.
> However, do we have to serialize it with every instance we send?
>
> 2/ to distinguish between numeric and categorical we need some metadata,
> which I guess goes into InstancesHeader.
> I am fine with keeping it also for compatibility with MOA, and we might use
> it if we have access to it.
> However, I would prefer algorithms not to rely on it, and consider the
> presence of metadata optional.
>
> Some other points:
> - what's the difference between InstanceInformation and InstancesHeaders
> - can the AttributesInformation be modified at runtime? Or is it statically
> set for the whole duration of the algorithm?
>
> Cheers,
>
> --
> Gianmarco
>
> On 10 January 2015 at 04:26, Albert Bifet <ab...@apache.org> wrote:
>
> > Hi all,
> >
> > This is a short explanation of the new instances of SAMOA.
> >
> >
> >
> https://github.com/abifet/moa/tree/master/moa/src/main/java/com/yahoo/labs/samoa/instances
> >
> > Instances will be much simpler than the current implementation. They
> > can be dense or sparse, and they contain only one array (or two for
> > sparse) with all the attribute values. In the current implementation
> > we have two arrays, one for input values and another for output values
> >
> > The main changes are two:
> >
> > 1/ All instances are going to be multi-label, that means they have
> > input and output attributes, and we can call their values with
> > getInputValue(i) and getOutputValue(i).
> >
> > 2/ Attributes are numeric by default, so we only keep information of
> > discrete attributes (values). For example if we have one million
> > numeric attributes, we will not need to store attribute information of
> > these one million numeric attributes.
> >
> > Basically, we have:
> >
> > - Instance: interface
> > - MultiLabelInstance: interface (empty interface that extends Instance)
> > - InstanceImpl extends MultiLabelInstance: implementation of Instance.
> > Contains
> >     - InstanceData
> >     - InstancesHeader
> > - DenseInstance extends InstanceImpl
> > - SparseInstance extends InstanceImpl
> >
> > -Instances: a list of instances and an InstanceInformation object
> > -InstancesHeader extends Instances
> >
> > -InstanceData: interface
> > -DenseInstanceData implements InstanceData
> > -SparseInstanceData implements InstanceData
> >
> > - InstanceInformation contains name, attribute information and
> > attributes to predict.
> > - AttributesInformation contains two list of Attributes (indices and
> > values) for non-numerical attributes. Numerical attributes are by
> > default
> > - Range: attributes to predict
> >
> > Cheers,
> >
> > Albert
> >
>

Re: New Instances

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Thanks Albert.

I have a couple of questions.

1/ how do we distinguish between input and output attributes?
In particular, let's take as an example the default single-label
classification.
I guess that is the role of Range.
However, do we have to serialize it with every instance we send?

2/ to distinguish between numeric and categorical we need some metadata,
which I guess goes into InstancesHeader.
I am fine with keeping it also for compatibility with MOA, and we might use
it if we have access to it.
However, I would prefer algorithms not to rely on it, and consider the
presence of metadata optional.

Some other points:
- what's the difference between InstanceInformation and InstancesHeaders
- can the AttributesInformation be modified at runtime? Or is it statically
set for the whole duration of the algorithm?

Cheers,

--
Gianmarco

On 10 January 2015 at 04:26, Albert Bifet <ab...@apache.org> wrote:

> Hi all,
>
> This is a short explanation of the new instances of SAMOA.
>
>
> https://github.com/abifet/moa/tree/master/moa/src/main/java/com/yahoo/labs/samoa/instances
>
> Instances will be much simpler than the current implementation. They
> can be dense or sparse, and they contain only one array (or two for
> sparse) with all the attribute values. In the current implementation
> we have two arrays, one for input values and another for output values
>
> The main changes are two:
>
> 1/ All instances are going to be multi-label, that means they have
> input and output attributes, and we can call their values with
> getInputValue(i) and getOutputValue(i).
>
> 2/ Attributes are numeric by default, so we only keep information of
> discrete attributes (values). For example if we have one million
> numeric attributes, we will not need to store attribute information of
> these one million numeric attributes.
>
> Basically, we have:
>
> - Instance: interface
> - MultiLabelInstance: interface (empty interface that extends Instance)
> - InstanceImpl extends MultiLabelInstance: implementation of Instance.
> Contains
>     - InstanceData
>     - InstancesHeader
> - DenseInstance extends InstanceImpl
> - SparseInstance extends InstanceImpl
>
> -Instances: a list of instances and an InstanceInformation object
> -InstancesHeader extends Instances
>
> -InstanceData: interface
> -DenseInstanceData implements InstanceData
> -SparseInstanceData implements InstanceData
>
> - InstanceInformation contains name, attribute information and
> attributes to predict.
> - AttributesInformation contains two list of Attributes (indices and
> values) for non-numerical attributes. Numerical attributes are by
> default
> - Range: attributes to predict
>
> Cheers,
>
> Albert
>