You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Karl Wettin <ka...@gmail.com> on 2008/03/01 22:55:28 UTC

Re: [jira] Updated: (MAHOUT-8) Data definition model

Sorry for killing the conversation with my last comment. I'll try again.

There is all this really nice code in the SVN. Thats great. But I really
want a low level random access instances "input stream" and a generic
header definition so that anything we do have some common ground.

In the long run, I belive this is vital for the project. Even more
important than having lots and lots of algorithms. And that is why I
stay in this layer.

If the JSRs and the code I posted in the Jira is overkill, I'm ok with
that. I'm still quite attacted by the JSRs and I'm going to stay in this
layer until someone tells me I'm nothing but the devils advocate.



     karl


Ted Dunning skrev:
> 
> The thing that brings me up short when reading things like this JSR is that
> they have a LOT of mechanism here to explain something that is pretty simple
> in a language like R with the data.frame object.
> 
> I am left with the question of what is going on with the complexity.  Some
> explanations that I could imagine include:
> 
> A) the complexity is optional and R has a simpler solution
> 
> B) the language of discourse is somehow evil and R is just as complex, but
> it is somehow vastly easier to explain an R data.frame than it is to explain
> what the JSR is talking about.
> 
> C) Java itself is somehow at fault and it is forcing complexity on the
> problem that isn't necessary
> 
> D) I am clueless and R lacks the complexity, the JSR has it but it is all
> necessary.
> 
> My gut says that (a) is the right answer.  My ego causes me to discount (d).
> My religion causes me to discount (mostly) (c).  I would find it hard to
> argue why (b) is not true.
> 
> Anybody else have an opinion?
> 
> On 2/28/08 11:40 AM, "Karl Wettin" <ka...@gmail.com> wrote:
> 
>> The simplest way to explain this is to say it is the data headers. Here
>> is a simple example:
>>
>> Say column 1 is a numeric value and column 2 is a class value. Some
>> algorithms might only accept discrete values, but I know for a fact that
>> the numeric value is an integer between 1 and 10 and could thus be
>> treaded as discerete value even though it is not. I don't want to go
>> messing about in the headers of the physical data set, nor do I want to
>> transform the complete data set, instead I remap the first column and
>> state it is a discrete value in my logical data set.
>>
>> The data definition model does not care if the attribute values are
>> integers, string or what not. They are all objects and they can be
>> transformed to the type it was mapped as. And the seperation of layers
>> makes it very simple.
>>
>> But most important, I really want to see a unison data model definition
>> (typed instance headers) and a very simple abstract way to access
>> physical data records (the instances) we can share between data
>> tranformation suites, ML algorithms, feature selectors and what not.
>>
>> Do you read UML? The JSRs have some great documentation then.
>>
>> There is more to it that what I tried to explain here. And I probably
>> didn't pick half of the ideas behind the JSR data models.
>>
>>
>>      karl
>>
>> Grant Ingersoll skrev:
>>> I haven't looked at JSRs.  Can you explain the use cases a bit more?
>>> How it would be used in M/R, and in implementations?  I like the sounds
>>> of it
>>>
>>>
>>> On Feb 25, 2008, at 4:34 PM, Karl Wettin (JIRA) wrote:
>>>
>>>>     [ 
>>>> https://issues.apache.org/jira/browse/MAHOUT-8?page=com.atlassian.jira.plugi
>>>> n.system.issuetabpanels:all-tabpanel ]
>>>>
>>>>
>>>> Karl Wettin updated MAHOUT-8:
>>>> -----------------------------
>>>>
>>>>    Attachment: pseudo_jsr.txt
>>>>
>>>> My question is, did anyone else take a closer look at the JSRs? I
>>>> would very much like to hear what you people think of this data model.
>>>> I'm quite attracted to it.
>>>>
>>>> It says nothing about how data is stored, it is about roles and
>>>> abstract access to physical instance data. And it seperates logical
>>>> (the data set definition used by ML algorithms) from physical (the
>>>> deta set definition describing the source data) model, allowing one to
>>>> vitually transform the data set by mapping logical data to the
>>>> physical data in any way without messing things up.
>>>>
>>>> I now have this half baked pseudo implementation of this. It uses
>>>> abstract classes rather than interfaces, and some of the interfaces
>>>> have been merged to a single class. It would however not be a big deal
>>>> to have it implement the interfaces if one wish. I feel some of the
>>>> stuff in there is a bit overkill at this point, but I tried to follow
>>>> the specs as well as I could (I replaced a bit of ad hoc enum classes
>>>> with enums, etc).
>>>>
>>>> There is no documentation, tests or anything concrete, just a bunch of
>>>> classes I'm now popping in the JIRA to show what it could look like.
>>>>
>>>> Actually, there is an early attempt at an abstract seekable physical
>>>> data record reader. And an ARFF writer. They are sort of my dry coded
>>>> thoughts. You can ignore them.
>>>>
>>>>
>>>>> Data definition model
>>>>> ---------------------
>>>>>
>>>>>                Key: MAHOUT-8
>>>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-8
>>>>>            Project: Mahout
>>>>>         Issue Type: New Feature
>>>>>           Reporter: Karl Wettin
>>>>>        Attachments: pseudo_jsr.txt
>>>>>
>>>>>
>>>>> How do we define classes, attributes and instance data?
>>>>> This has nothing to do with physical data records, this is about data
>>>>> types, roles, etc.
>>>> -- 
>>>> This message is automatically generated by JIRA.
>>>> -
>>>> You can reply to this email to add a comment to the issue online.
>>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucenebootcamp.com
>>> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>
> 



Re: [jira] Updated: (MAHOUT-8) Data definition model

Posted by Grant Ingersoll <gs...@apache.org>.
Hey Karl,

I recall when we first started this, that we all seemed to agree that  
trying to unify things too early is not time well spent and that we  
should focus on algorithms and not on building a framework for  
commonality across all the algorithms.  As I recall, our general  
sentiment was that it would bring a sense of a lowest common  
denominator to the project since each algorithm may not be best suited  
for the framework.

The take away for me, is I feel like it is too early to know what is  
going to be needed across the algorithms we are implementing.  My gut  
says, let's focus on building algs up through say release 0.5, and  
then we can think about what gets us to 1.0, which probably will  
involve what you are talking about here, but it may not.  Otherwise, I  
think it will be a bit of the cart before the horse, in that we will  
be creating infrastructure for things that we aren't sure will need  
that infrastructure (granted, this stuff is probably needed by most,  
but that is to be determined)

-Grant

On Mar 1, 2008, at 4:55 PM, Karl Wettin wrote:

> Sorry for killing the conversation with my last comment. I'll try  
> again.
>
> There is all this really nice code in the SVN. Thats great. But I  
> really
> want a low level random access instances "input stream" and a generic
> header definition so that anything we do have some common ground.
>
> In the long run, I belive this is vital for the project. Even more
> important than having lots and lots of algorithms. And that is why I
> stay in this layer.
>
> If the JSRs and the code I posted in the Jira is overkill, I'm ok with
> that. I'm still quite attacted by the JSRs and I'm going to stay in  
> this
> layer until someone tells me I'm nothing but the devils advocate.
>
>
>
>    karl
>
>
> Ted Dunning skrev:
>> The thing that brings me up short when reading things like this JSR  
>> is that
>> they have a LOT of mechanism here to explain something that is  
>> pretty simple
>> in a language like R with the data.frame object.
>> I am left with the question of what is going on with the  
>> complexity.  Some
>> explanations that I could imagine include:
>> A) the complexity is optional and R has a simpler solution
>> B) the language of discourse is somehow evil and R is just as  
>> complex, but
>> it is somehow vastly easier to explain an R data.frame than it is  
>> to explain
>> what the JSR is talking about.
>> C) Java itself is somehow at fault and it is forcing complexity on  
>> the
>> problem that isn't necessary
>> D) I am clueless and R lacks the complexity, the JSR has it but it  
>> is all
>> necessary.
>> My gut says that (a) is the right answer.  My ego causes me to  
>> discount (d).
>> My religion causes me to discount (mostly) (c).  I would find it  
>> hard to
>> argue why (b) is not true.
>> Anybody else have an opinion?
>> On 2/28/08 11:40 AM, "Karl Wettin" <ka...@gmail.com> wrote:
>>> The simplest way to explain this is to say it is the data headers.  
>>> Here
>>> is a simple example:
>>>
>>> Say column 1 is a numeric value and column 2 is a class value. Some
>>> algorithms might only accept discrete values, but I know for a  
>>> fact that
>>> the numeric value is an integer between 1 and 10 and could thus be
>>> treaded as discerete value even though it is not. I don't want to go
>>> messing about in the headers of the physical data set, nor do I  
>>> want to
>>> transform the complete data set, instead I remap the first column  
>>> and
>>> state it is a discrete value in my logical data set.
>>>
>>> The data definition model does not care if the attribute values are
>>> integers, string or what not. They are all objects and they can be
>>> transformed to the type it was mapped as. And the seperation of  
>>> layers
>>> makes it very simple.
>>>
>>> But most important, I really want to see a unison data model  
>>> definition
>>> (typed instance headers) and a very simple abstract way to access
>>> physical data records (the instances) we can share between data
>>> tranformation suites, ML algorithms, feature selectors and what not.
>>>
>>> Do you read UML? The JSRs have some great documentation then.
>>>
>>> There is more to it that what I tried to explain here. And I  
>>> probably
>>> didn't pick half of the ideas behind the JSR data models.
>>>
>>>
>>>     karl
>>>
>>> Grant Ingersoll skrev:
>>>> I haven't looked at JSRs.  Can you explain the use cases a bit  
>>>> more?
>>>> How it would be used in M/R, and in implementations?  I like the  
>>>> sounds
>>>> of it
>>>>
>>>>
>>>> On Feb 25, 2008, at 4:34 PM, Karl Wettin (JIRA) wrote:
>>>>
>>>>>    [ https://issues.apache.org/jira/browse/MAHOUT-8?page=com.atlassian.jira.plugi
>>>>> n.system.issuetabpanels:all-tabpanel ]
>>>>>
>>>>>
>>>>> Karl Wettin updated MAHOUT-8:
>>>>> -----------------------------
>>>>>
>>>>>   Attachment: pseudo_jsr.txt
>>>>>
>>>>> My question is, did anyone else take a closer look at the JSRs? I
>>>>> would very much like to hear what you people think of this data  
>>>>> model.
>>>>> I'm quite attracted to it.
>>>>>
>>>>> It says nothing about how data is stored, it is about roles and
>>>>> abstract access to physical instance data. And it seperates  
>>>>> logical
>>>>> (the data set definition used by ML algorithms) from physical (the
>>>>> deta set definition describing the source data) model, allowing  
>>>>> one to
>>>>> vitually transform the data set by mapping logical data to the
>>>>> physical data in any way without messing things up.
>>>>>
>>>>> I now have this half baked pseudo implementation of this. It uses
>>>>> abstract classes rather than interfaces, and some of the  
>>>>> interfaces
>>>>> have been merged to a single class. It would however not be a  
>>>>> big deal
>>>>> to have it implement the interfaces if one wish. I feel some of  
>>>>> the
>>>>> stuff in there is a bit overkill at this point, but I tried to  
>>>>> follow
>>>>> the specs as well as I could (I replaced a bit of ad hoc enum  
>>>>> classes
>>>>> with enums, etc).
>>>>>
>>>>> There is no documentation, tests or anything concrete, just a  
>>>>> bunch of
>>>>> classes I'm now popping in the JIRA to show what it could look  
>>>>> like.
>>>>>
>>>>> Actually, there is an early attempt at an abstract seekable  
>>>>> physical
>>>>> data record reader. And an ARFF writer. They are sort of my dry  
>>>>> coded
>>>>> thoughts. You can ignore them.
>>>>>
>>>>>
>>>>>> Data definition model
>>>>>> ---------------------
>>>>>>
>>>>>>               Key: MAHOUT-8
>>>>>>               URL: https://issues.apache.org/jira/browse/MAHOUT-8
>>>>>>           Project: Mahout
>>>>>>        Issue Type: New Feature
>>>>>>          Reporter: Karl Wettin
>>>>>>       Attachments: pseudo_jsr.txt
>>>>>>
>>>>>>
>>>>>> How do we define classes, attributes and instance data?
>>>>>> This has nothing to do with physical data records, this is  
>>>>>> about data
>>>>>> types, roles, etc.
>>>>> -- 
>>>>> This message is automatically generated by JIRA.
>>>>> -
>>>>> You can reply to this email to add a comment to the issue online.
>>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucenebootcamp.com
>>>> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>>>>
>>>> Lucene Helpful Hints:
>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>
>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ