You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2011/11/29 05:25:11 UTC

Data class taxonomy for machine learning

Is this a fair breakdown of data classes?

http://smlv.cc.gatech.edu/2010/03/23/a-taxonomy-of-data-types/

(btw everything tagged DAVA is interesting)

-- 
Lance Norskog
goksron@gmail.com

Re: Data class taxonomy for machine learning

Posted by Ted Dunning <te...@gmail.com>.
Join the lines together.

On Wed, Nov 30, 2011 at 8:45 PM, Lance Norskog <go...@gmail.com> wrote:

> Oops, the other one:
> Datenaufbereitung.pdf<
> http://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf
> >does
> not work.
>
> On Wed, Nov 30, 2011 at 8:41 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > It is not spelled that way in german.  Use an s near the end of the word.
> >
> > Other than that, I can't imagine the problem.  The link worked for me
> > earlier today and just now as well.
> >
> > On Wed, Nov 30, 2011 at 7:20 PM, Lance Norskog <go...@gmail.com>
> wrote:
> >
> > > Problemanalyze.pdf is not there.
> > >
> > > On Wed, Nov 30, 2011 at 1:14 PM, Isabel Drost <is...@apache.org>
> wrote:
> > >
> > > > On 29.11.2011 Ted Dunning wrote:
> > > > > I find this taxonomy excessive and over-done.  The distinctions I
> > find
> > > > > useful include
> > > > >
> > > > > - continuous variables
> > > > >
> > > > > - discrete variables with a known set of values (I call these
> > > > categorical,
> > > > > usually).  This includes ordinal variables since ordering rarely
> > makes
> > > a
> > > > > lot of difference.
> > > > >
> > > > > - discrete variables with a large or not well known set of possible
> > > > values
> > > > > (I call these "word-like")
> > > > >
> > > > > - bags or lists of word-like variables (I call these text-like)
> > > >
> > > > What I found useful for explaining which data types to expect::
> > > >
> > > > http://www.cs.uni-
> > > >
> > >
> >
> potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf(Slide
> <
> http://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf%28Slide
> >
> > > > 6, unfortunately in German only)
> > > >
> > > > What seemed more needed was an explanation of different problem
> > settings
> > > > and how
> > > > to tackle them on a very high level:
> > > > http://www.cs.uni-potsdam.de/ml/teaching/ws10/ida/Problemanalyse.pdf
> > > >
> > > >
> > > > Isabel
> > > >
> > >
> > >
> > >
> > > --
> > > Lance Norskog
> > > goksron@gmail.com
> > >
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Data class taxonomy for machine learning

Posted by Lance Norskog <go...@gmail.com>.
The problem was that when the full link line broke, the remainder started
with potsdam.de and was a real link.

On Thu, Dec 1, 2011 at 12:41 AM, Manuel Blechschmidt <
Manuel.Blechschmidt@gmx.de> wrote:

> Hi Lance,
>
> On 01.12.2011, at 05:45, Lance Norskog wrote:
>
> > Oops, the other one:
> > Datenaufbereitung.pdf<
> http://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf
> >does
> > not work.
>
> This is the correct one:
>
> http://www.cs.uni-potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf
>
> >
> > On Wed, Nov 30, 2011 at 8:41 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> It is not spelled that way in german.  Use an s near the end of the
> word.
> >>
> >> Other than that, I can't imagine the problem.  The link worked for me
> >> earlier today and just now as well.
> >>
> >> On Wed, Nov 30, 2011 at 7:20 PM, Lance Norskog <go...@gmail.com>
> wrote:
> >>
> >>> Problemanalyze.pdf is not there.
> >>>
> >>> On Wed, Nov 30, 2011 at 1:14 PM, Isabel Drost <is...@apache.org>
> wrote:
> >>>
> >>>> On 29.11.2011 Ted Dunning wrote:
> >>>>> I find this taxonomy excessive and over-done.  The distinctions I
> >> find
> >>>>> useful include
> >>>>>
> >>>>> - continuous variables
> >>>>>
> >>>>> - discrete variables with a known set of values (I call these
> >>>> categorical,
> >>>>> usually).  This includes ordinal variables since ordering rarely
> >> makes
> >>> a
> >>>>> lot of difference.
> >>>>>
> >>>>> - discrete variables with a large or not well known set of possible
> >>>> values
> >>>>> (I call these "word-like")
> >>>>>
> >>>>> - bags or lists of word-like variables (I call these text-like)
> >>>>
> >>>> What I found useful for explaining which data types to expect::
> >>>>
> >>>> http://www.cs.uni-
> >>>>
> >>>
> >>
> potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf(Slide<http://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf%28Slide>
> <
> http://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf%28Slide
> >
> >>>> 6, unfortunately in German only)
> >>>>
> >>>> What seemed more needed was an explanation of different problem
> >> settings
> >>>> and how
> >>>> to tackle them on a very high level:
> >>>> http://www.cs.uni-potsdam.de/ml/teaching/ws10/ida/Problemanalyse.pdf
> >>>>
> >>>>
> >>>> Isabel
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Lance Norskog
> >>> goksron@gmail.com
> >>>
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
>
> --
> Manuel Blechschmidt
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B
>
>


-- 
Lance Norskog
goksron@gmail.com

Re: Data class taxonomy for machine learning

Posted by Manuel Blechschmidt <Ma...@gmx.de>.
Hi Lance,

On 01.12.2011, at 05:45, Lance Norskog wrote:

> Oops, the other one:
> Datenaufbereitung.pdf<http://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf>does
> not work.

This is the correct one:
http://www.cs.uni-potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf

> 
> On Wed, Nov 30, 2011 at 8:41 PM, Ted Dunning <te...@gmail.com> wrote:
> 
>> It is not spelled that way in german.  Use an s near the end of the word.
>> 
>> Other than that, I can't imagine the problem.  The link worked for me
>> earlier today and just now as well.
>> 
>> On Wed, Nov 30, 2011 at 7:20 PM, Lance Norskog <go...@gmail.com> wrote:
>> 
>>> Problemanalyze.pdf is not there.
>>> 
>>> On Wed, Nov 30, 2011 at 1:14 PM, Isabel Drost <is...@apache.org> wrote:
>>> 
>>>> On 29.11.2011 Ted Dunning wrote:
>>>>> I find this taxonomy excessive and over-done.  The distinctions I
>> find
>>>>> useful include
>>>>> 
>>>>> - continuous variables
>>>>> 
>>>>> - discrete variables with a known set of values (I call these
>>>> categorical,
>>>>> usually).  This includes ordinal variables since ordering rarely
>> makes
>>> a
>>>>> lot of difference.
>>>>> 
>>>>> - discrete variables with a large or not well known set of possible
>>>> values
>>>>> (I call these "word-like")
>>>>> 
>>>>> - bags or lists of word-like variables (I call these text-like)
>>>> 
>>>> What I found useful for explaining which data types to expect::
>>>> 
>>>> http://www.cs.uni-
>>>> 
>>> 
>> potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf(Slide<http://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf%28Slide>
>>>> 6, unfortunately in German only)
>>>> 
>>>> What seemed more needed was an explanation of different problem
>> settings
>>>> and how
>>>> to tackle them on a very high level:
>>>> http://www.cs.uni-potsdam.de/ml/teaching/ws10/ida/Problemanalyse.pdf
>>>> 
>>>> 
>>>> Isabel
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

-- 
Manuel Blechschmidt
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B


Re: Data class taxonomy for machine learning

Posted by Lance Norskog <go...@gmail.com>.
Oops, the other one:
Datenaufbereitung.pdf<http://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf>does
not work.

On Wed, Nov 30, 2011 at 8:41 PM, Ted Dunning <te...@gmail.com> wrote:

> It is not spelled that way in german.  Use an s near the end of the word.
>
> Other than that, I can't imagine the problem.  The link worked for me
> earlier today and just now as well.
>
> On Wed, Nov 30, 2011 at 7:20 PM, Lance Norskog <go...@gmail.com> wrote:
>
> > Problemanalyze.pdf is not there.
> >
> > On Wed, Nov 30, 2011 at 1:14 PM, Isabel Drost <is...@apache.org> wrote:
> >
> > > On 29.11.2011 Ted Dunning wrote:
> > > > I find this taxonomy excessive and over-done.  The distinctions I
> find
> > > > useful include
> > > >
> > > > - continuous variables
> > > >
> > > > - discrete variables with a known set of values (I call these
> > > categorical,
> > > > usually).  This includes ordinal variables since ordering rarely
> makes
> > a
> > > > lot of difference.
> > > >
> > > > - discrete variables with a large or not well known set of possible
> > > values
> > > > (I call these "word-like")
> > > >
> > > > - bags or lists of word-like variables (I call these text-like)
> > >
> > > What I found useful for explaining which data types to expect::
> > >
> > > http://www.cs.uni-
> > >
> >
> potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf(Slide<http://potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf%28Slide>
> > > 6, unfortunately in German only)
> > >
> > > What seemed more needed was an explanation of different problem
> settings
> > > and how
> > > to tackle them on a very high level:
> > > http://www.cs.uni-potsdam.de/ml/teaching/ws10/ida/Problemanalyse.pdf
> > >
> > >
> > > Isabel
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>



-- 
Lance Norskog
goksron@gmail.com

Re: Data class taxonomy for machine learning

Posted by Ted Dunning <te...@gmail.com>.
It is not spelled that way in german.  Use an s near the end of the word.

Other than that, I can't imagine the problem.  The link worked for me
earlier today and just now as well.

On Wed, Nov 30, 2011 at 7:20 PM, Lance Norskog <go...@gmail.com> wrote:

> Problemanalyze.pdf is not there.
>
> On Wed, Nov 30, 2011 at 1:14 PM, Isabel Drost <is...@apache.org> wrote:
>
> > On 29.11.2011 Ted Dunning wrote:
> > > I find this taxonomy excessive and over-done.  The distinctions I find
> > > useful include
> > >
> > > - continuous variables
> > >
> > > - discrete variables with a known set of values (I call these
> > categorical,
> > > usually).  This includes ordinal variables since ordering rarely makes
> a
> > > lot of difference.
> > >
> > > - discrete variables with a large or not well known set of possible
> > values
> > > (I call these "word-like")
> > >
> > > - bags or lists of word-like variables (I call these text-like)
> >
> > What I found useful for explaining which data types to expect::
> >
> > http://www.cs.uni-
> >
> potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf(Slide
> > 6, unfortunately in German only)
> >
> > What seemed more needed was an explanation of different problem settings
> > and how
> > to tackle them on a very high level:
> > http://www.cs.uni-potsdam.de/ml/teaching/ws10/ida/Problemanalyse.pdf
> >
> >
> > Isabel
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Data class taxonomy for machine learning

Posted by Lance Norskog <go...@gmail.com>.
Problemanalyze.pdf is not there.

On Wed, Nov 30, 2011 at 1:14 PM, Isabel Drost <is...@apache.org> wrote:

> On 29.11.2011 Ted Dunning wrote:
> > I find this taxonomy excessive and over-done.  The distinctions I find
> > useful include
> >
> > - continuous variables
> >
> > - discrete variables with a known set of values (I call these
> categorical,
> > usually).  This includes ordinal variables since ordering rarely makes a
> > lot of difference.
> >
> > - discrete variables with a large or not well known set of possible
> values
> > (I call these "word-like")
> >
> > - bags or lists of word-like variables (I call these text-like)
>
> What I found useful for explaining which data types to expect::
>
> http://www.cs.uni-
> potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf(Slide
> 6, unfortunately in German only)
>
> What seemed more needed was an explanation of different problem settings
> and how
> to tackle them on a very high level:
> http://www.cs.uni-potsdam.de/ml/teaching/ws10/ida/Problemanalyse.pdf
>
>
> Isabel
>



-- 
Lance Norskog
goksron@gmail.com

Re: Data class taxonomy for machine learning

Posted by Isabel Drost <is...@apache.org>.
On 29.11.2011 Ted Dunning wrote:
> I find this taxonomy excessive and over-done.  The distinctions I find
> useful include
> 
> - continuous variables
> 
> - discrete variables with a known set of values (I call these categorical,
> usually).  This includes ordinal variables since ordering rarely makes a
> lot of difference.
> 
> - discrete variables with a large or not well known set of possible values
> (I call these "word-like")
> 
> - bags or lists of word-like variables (I call these text-like)

What I found useful for explaining which data types to expect::

http://www.cs.uni-
potsdam.de/ml/teaching/ws10/ida/Datenselektion_und_Datenaufbereitung.pdf (Slide 
6, unfortunately in German only) 

What seemed more needed was an explanation of different problem settings and how 
to tackle them on a very high level:
http://www.cs.uni-potsdam.de/ml/teaching/ws10/ida/Problemanalyse.pdf


Isabel

Re: Data class taxonomy for machine learning

Posted by Ted Dunning <te...@gmail.com>.
I find this taxonomy excessive and over-done.  The distinctions I find
useful include

- continuous variables

- discrete variables with a known set of values (I call these categorical,
usually).  This includes ordinal variables since ordering rarely makes a
lot of difference.

- discrete variables with a large or not well known set of possible values
(I call these "word-like")

- bags or lists of word-like variables (I call these text-like)

Occasionally, I also use

- bags of (word, time, amount) triples where time and amount are continuous
variables.  I call these transactions.

Most of the rest is fluff.  You might want it for algebraic completeness
and ability to describe absolutely everything, but it really doesn't much
matter for practical purposes.

On Tue, Nov 29, 2011 at 12:08 AM, Konstantin Shmakov <ks...@gmail.com>wrote:

> It is missing definition of "atom" (at least the page referred to); is it
> the basic piece of information?
>
> It is also seems that "numeric" is continuous (temperature, fin data) and
> "categoric" and "ordinal" are discrete (words, ratings).
>
> As such all these data types will be more naturally categorized along 3
> dimensions:
> - continuous, discrete
> - ordered, unordered
> - data dimensionality (1d, 2d, 3d)
>
> --
>
>
> On Mon, Nov 28, 2011 at 8:25 PM, Lance Norskog <go...@gmail.com> wrote:
>
> > Is this a fair breakdown of data classes?
> >
> > http://smlv.cc.gatech.edu/2010/03/23/a-taxonomy-of-data-types/
> >
> > (btw everything tagged DAVA is interesting)
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>
>
>
> --
> ksh:
>

Re: Data class taxonomy for machine learning

Posted by Konstantin Shmakov <ks...@gmail.com>.
It is missing definition of "atom" (at least the page referred to); is it
the basic piece of information?

It is also seems that "numeric" is continuous (temperature, fin data) and
"categoric" and "ordinal" are discrete (words, ratings).

As such all these data types will be more naturally categorized along 3
dimensions:
- continuous, discrete
- ordered, unordered
- data dimensionality (1d, 2d, 3d)

--


On Mon, Nov 28, 2011 at 8:25 PM, Lance Norskog <go...@gmail.com> wrote:

> Is this a fair breakdown of data classes?
>
> http://smlv.cc.gatech.edu/2010/03/23/a-taxonomy-of-data-types/
>
> (btw everything tagged DAVA is interesting)
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
ksh: