You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samoa.apache.org by Gianmarco De Francisci Morales <gd...@apache.org> on 2015/03/30 10:26:47 UTC

Re: New Instances

Hi,

I think we should restart this conversation.
Matthieu, do you think we can review the branch?
Or do you want to do any update on it before?

Cheers,

--
Gianmarco

On 26 January 2015 at 16:20, Albert Bifet <ab...@waikato.ac.nz> wrote:

> Hi Matthieu,
>
> Thanks for your answers! I agree with using double values to store
> attribute information. I think we need to define how to maintain the
> mapping, as some learners need to know if attributes are discrete or
> numeric, in order to learn and do predictions, and how many values  the
> discrete attributes have.
>
> Cheers, Albert
>
> On Mon, Jan 26, 2015 at 7:33 PM, Matthieu Morel <mm...@apache.org> wrote:
>
> > - discrete attributes are eventually mapped to double values, and
> > that's the appropriate input to instances, in my understanding. My
> > idea was to maintain the mapping in the feature extraction step, and
> > share it in some way with the processing topology.
> >
> > - regarding performance in sparse instances, I haven't done any sort
> > of benchmark yet. The implementation can be changed while keeping the
> > same API.
> > From what I see, on the one hand, in the current approach using an
> > index array, we have the extra constraints that 1/ this index array
> > must be sorted (adds building time), and 2/ we have to do a binary
> > search for the index value (log(n)).
> > On the other hand, there are some very efficient map implementations
> > that we could reuse. For example, CERN's colt package, actually
> > already imported in the mahout-collections ASF package.
> >
> > I hope this answers your questions,
> >
> > Matthieu
> >
> >
> > On Mon, Jan 26, 2015 at 7:30 AM, Albert Bifet <ab...@waikato.ac.nz>
> > wrote:
> > > Nice and simple API! Some things to comment:
> > >
> > > - how can we manage discrete attributes, for example attribute class:
> > > "+","-"?
> > >
> > > - In sparse instances, is the performance of a map similar to the
> > > performance of two arrays, one for indices and one for values?
> > >
> > > Albert
> > >
> > > On Sat, Jan 24, 2015 at 1:38 AM, Matthieu Morel <
> > matthieu.morel@gmail.com>
> > > wrote:
> > >
> > >> I took a shot at drafting a simplified API for instances.
> > >> https://github.com/matthieumorel/samoa/tree/new-instances
> > >>
> > >> As pointed out in this thread, the current API is too exhaustive, too
> > >> tied to a specific implementation, and too tied to WEKA/MOA APIs.
> > >>
> > >> In addition, I feel the header/information does not belong to the
> > >> instance. This is something which is used when parsing arff files
> > >> where static information about the stream is available from the start.
> > >> But for a real streaming use case, we should not make such assumption.
> > >> Whatever is known at the begining should be loaded by the topology,
> > >> but not included in the instances. Arff files can still be loaded and
> > >> generate instances in the new format. Only the headers should be
> > >> parsed separately.
> > >>
> > >> This proposal is a draft and single label only. It should be easy to
> > >> add functionality suggested by Albert for multi labels.
> > >>
> > >> Feel free to comment!
> > >>
> > >> Matthieu
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <ab...@waikato.ac.nz>
> > >> wrote:
> > >> > 1/ Learners as decision trees can deal with new instances that
> arrive
> > >> > with more label classes. New instances can arrive with new headers.
> > >> >
> > >> > 2/ To change class labels dynamically, we need to add a method
> > >> > "setValue(int, string)" in the Attribute class to add dynamically
> new
> > >> > attribute values.
> > >> >
> > >> > 3/ The current design is being compatible with the methods in weka
> > >> > instances. It could be nice to have a fresher design. I will need
> some
> > >> > help to have a simplified and fresher design of the instances as
> I'm a
> > >> > bit conditioned by the previous instance usage :)
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Albert
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
> > >> > <ol...@gmail.com> wrote:
> > >> >> Hey Matthieu,
> > >> >>
> > >> >>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <
> > matthieu.morel@gmail.com>
> > >> wrote:
> > >> >>>
> > >> >>> I'm confused. From what I see the number of classes is currently
> > fixed
> > >> >>> in the instance header. As if it was static. I suppose you can
> work
> > >> >>> around that limitation with some hacks but I want to use a clean
> API
> > >> >>> for that.
> > >> >>>
> > >> >>> Or is there a recommended way I'm missing?
> > >> >>
> > >> >> Ah, I think I remember now what happened. As far as I encountered
> > this
> > >> situation, the data had say an .arff format with a header stating the
> > >> number of class values, and the instance header was read from that,
> > while
> > >> the instances were then read by the line.
> > >> >>
> > >> >> I worked around that by just storing the class label seen in the
> > >> instances on the fly when building a model, and ignored that field of
> > the
> > >> instance header. Sorry for the confusion!
> > >> >>
> > >> >> Cheers,
> > >> >> Olivier
> > >> >>
> > >> >>
> > >>
> >
>

Re: New Instances

Posted by Matthieu Morel <mm...@apache.org>.

Hi, I won't be doing updates at the moment.

Note that this won't merge directly. One reason if I remember
correctly is that in contrast to the existing implementation, the
instances in this proposal do not keep the whole history. It's really
intended for streaming.

Matthieu

On Mon, Mar 30, 2015 at 10:26 AM, Gianmarco De Francisci Morales
<gd...@apache.org> wrote:
> Hi,
>
> I think we should restart this conversation.
> Matthieu, do you think we can review the branch?
> Or do you want to do any update on it before?
>
> Cheers,
>
> --
> Gianmarco
>
> On 26 January 2015 at 16:20, Albert Bifet <ab...@waikato.ac.nz> wrote:
>
>> Hi Matthieu,
>>
>> Thanks for your answers! I agree with using double values to store
>> attribute information. I think we need to define how to maintain the
>> mapping, as some learners need to know if attributes are discrete or
>> numeric, in order to learn and do predictions, and how many values  the
>> discrete attributes have.
>>
>> Cheers, Albert
>>
>> On Mon, Jan 26, 2015 at 7:33 PM, Matthieu Morel <mm...@apache.org> wrote:
>>
>> > - discrete attributes are eventually mapped to double values, and
>> > that's the appropriate input to instances, in my understanding. My
>> > idea was to maintain the mapping in the feature extraction step, and
>> > share it in some way with the processing topology.
>> >
>> > - regarding performance in sparse instances, I haven't done any sort
>> > of benchmark yet. The implementation can be changed while keeping the
>> > same API.
>> > From what I see, on the one hand, in the current approach using an
>> > index array, we have the extra constraints that 1/ this index array
>> > must be sorted (adds building time), and 2/ we have to do a binary
>> > search for the index value (log(n)).
>> > On the other hand, there are some very efficient map implementations
>> > that we could reuse. For example, CERN's colt package, actually
>> > already imported in the mahout-collections ASF package.
>> >
>> > I hope this answers your questions,
>> >
>> > Matthieu
>> >
>> >
>> > On Mon, Jan 26, 2015 at 7:30 AM, Albert Bifet <ab...@waikato.ac.nz>
>> > wrote:
>> > > Nice and simple API! Some things to comment:
>> > >
>> > > - how can we manage discrete attributes, for example attribute class:
>> > > "+","-"?
>> > >
>> > > - In sparse instances, is the performance of a map similar to the
>> > > performance of two arrays, one for indices and one for values?
>> > >
>> > > Albert
>> > >
>> > > On Sat, Jan 24, 2015 at 1:38 AM, Matthieu Morel <
>> > matthieu.morel@gmail.com>
>> > > wrote:
>> > >
>> > >> I took a shot at drafting a simplified API for instances.
>> > >> https://github.com/matthieumorel/samoa/tree/new-instances
>> > >>
>> > >> As pointed out in this thread, the current API is too exhaustive, too
>> > >> tied to a specific implementation, and too tied to WEKA/MOA APIs.
>> > >>
>> > >> In addition, I feel the header/information does not belong to the
>> > >> instance. This is something which is used when parsing arff files
>> > >> where static information about the stream is available from the start.
>> > >> But for a real streaming use case, we should not make such assumption.
>> > >> Whatever is known at the begining should be loaded by the topology,
>> > >> but not included in the instances. Arff files can still be loaded and
>> > >> generate instances in the new format. Only the headers should be
>> > >> parsed separately.
>> > >>
>> > >> This proposal is a draft and single label only. It should be easy to
>> > >> add functionality suggested by Albert for multi labels.
>> > >>
>> > >> Feel free to comment!
>> > >>
>> > >> Matthieu
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <ab...@waikato.ac.nz>
>> > >> wrote:
>> > >> > 1/ Learners as decision trees can deal with new instances that
>> arrive
>> > >> > with more label classes. New instances can arrive with new headers.
>> > >> >
>> > >> > 2/ To change class labels dynamically, we need to add a method
>> > >> > "setValue(int, string)" in the Attribute class to add dynamically
>> new
>> > >> > attribute values.
>> > >> >
>> > >> > 3/ The current design is being compatible with the methods in weka
>> > >> > instances. It could be nice to have a fresher design. I will need
>> some
>> > >> > help to have a simplified and fresher design of the instances as
>> I'm a
>> > >> > bit conditioned by the previous instance usage :)
>> > >> >
>> > >> > Thanks,
>> > >> >
>> > >> > Albert
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
>> > >> > <ol...@gmail.com> wrote:
>> > >> >> Hey Matthieu,
>> > >> >>
>> > >> >>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <
>> > matthieu.morel@gmail.com>
>> > >> wrote:
>> > >> >>>
>> > >> >>> I'm confused. From what I see the number of classes is currently
>> > fixed
>> > >> >>> in the instance header. As if it was static. I suppose you can
>> work
>> > >> >>> around that limitation with some hacks but I want to use a clean
>> API
>> > >> >>> for that.
>> > >> >>>
>> > >> >>> Or is there a recommended way I'm missing?
>> > >> >>
>> > >> >> Ah, I think I remember now what happened. As far as I encountered
>> > this
>> > >> situation, the data had say an .arff format with a header stating the
>> > >> number of class values, and the instance header was read from that,
>> > while
>> > >> the instances were then read by the line.
>> > >> >>
>> > >> >> I worked around that by just storing the class label seen in the
>> > >> instances on the fly when building a model, and ignored that field of
>> > the
>> > >> instance header. Sorry for the confusion!
>> > >> >>
>> > >> >> Cheers,
>> > >> >> Olivier
>> > >> >>
>> > >> >>
>> > >>
>> >
>>