You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/04/02 22:08:36 UTC

Data frames

Are the Spark efforts supporting all Mahout Vector types? Named, Property Vectors? It occurred to me that data frames in R is a related but more general solution. If all rows and columns of a DRM and their coresponding Vectors (row or column vectors) were to support arbitrary properties attached to them in such a way that they are preserved during transpose, Vector extraction, and any other operations that make sense there would be a huge benefit for users.

One of the constant problems with input to Mahout is translation of IDs. External to Mahout going in, Mahout to external coming out. Most of this would be unneeded if Mahout supported data frames, some would be avoided by supporting named or property vectors universally.

Re: Data frames

Posted by Ted Dunning <te...@gmail.com>.

And the feature request should be phrased in terms of code with desired
behavior.




On Thu, Apr 3, 2014 at 8:00 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Perhaps this is best phrased as a feature request.
>
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> PS.
>
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
>
>
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> >
> >
> >
> > On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> >>
> >>> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >>>
> >>> I think this duality, names and keys, is not very healthy really, and
> >> just
> >>> creates addtutiinal hassle. Spark drm takes care of keys automatically
> >>> thoughout, but propagating names from name vectors is solely algorithm
> >>> concern as it stands.
> >>
> >> Not sure what you mean.
> >
> > Not what you think, it looks like.
> >
> > I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> > persisted, key goes to the key of a sequence file. In particular, it
> means
> > that there is a case of Bag[ key -> NamedVector]. Which means, external
> > anchor could be saved to either key or name of a row. In practice it
> causes
> > compatibility mess, e.g. we saw those numerous cases where e.g.
> seq2sparse
> > saves external keys (file paths) into  key, whereas e.g. clustering
> > algorithms are not seeing them because they expect them to be the name
> part
> > of the vector. I am just saying we have two ways to name the rows, and it
> > is generally not a healthy choice for the aforementioned reason.
> >
> >
> >> In my experience Names and Properties are primarily used to store
> >> external keys, which are quite healthy.
> >
> > Users never have data with Mahout keys, they must constantly go back and
> >> forth. This is exactly what the R data frame does, no? I'm not so
> concerned
> >> with being able to address an element by the external key
> >> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have
> the
> >> external ids follow the data through any calculation that makes sense.
> >>
> >
> > I am with you on this.
> >
> >
> >> This would mean clustering, recommendations, transpose, RSJ would
> require
> >> no id transforming steps. This would make dealing with Mahout much
> easier.
> >>
> >
> > Data frames is a little bit a different thing, right now we work just
> with
> > matrices. Although, yes, our in-core matrices support row and column
> names
> > (just like in R) and distributed matrices support row keys only.  what i
> > mean is that algebraic expression e.g.
> >
> > Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
> > above, but not necessarily named vectors, because internally algorithms
> > blockify things into matrix blocks, and i am far from sure that Mahout
> > in-core stuff works correctly with named vectors as part of a matrix
> block
> > in all situations. I may be wrong. I always relied on sequence file keys
> to
> > identify data points.
> >
> > Note that sequence file keys are bigger than just a name, it is anything
> > Writable. I.e. you could save a data structure there, as long as you
> have a
> > Writable for it.
> >
> >
> >>> On Apr 2, 2014 1:08 PM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
> >>>
> >>>> Are the Spark efforts supporting all Mahout Vector types? Named,
> >> Property
> >>>> Vectors? It occurred to me that data frames in R is a related but more
> >>>> general solution. If all rows and columns of a DRM and their
> >> coresponding
> >>>> Vectors (row or column vectors) were to support arbitrary properties
> >>>> attached to them in such a way that they are preserved during
> >> transpose,
> >>>> Vector extraction, and any other operations that make sense there
> >> would be
> >>>> a huge benefit for users.
> >>>>
> >>>> One of the constant problems with input to Mahout is translation of
> >> IDs.
> >>>> External to Mahout going in, Mahout to external coming out. Most of
> >> this
> >>>> would be unneeded if Mahout supported data frames, some would be
> >> avoided by
> >>>> supporting named or property vectors universally.
> >>>>
> >>>>
> >>>
> >>
> >
> >
>
>

Re: Data frames

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Perhaps this is best phrased as a feature request.

On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

PS.

sequence file keys have also special meaning if they are Ints. .E.g. A'
physical operator requires keys to be ints, in which case it interprets
them as row indexes that become column indexes. This of course isn't always
the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
reality optimizer will never choose actual transposition as a physical step
in such pipeline. This interpretation is consistent with interpretation of
long-existing Hadoop-side DistributedRowMatrix#transpose.


On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> 
> 
> 
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> 
>>> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> 
>>> I think this duality, names and keys, is not very healthy really, and
>> just
>>> creates addtutiinal hassle. Spark drm takes care of keys automatically
>>> thoughout, but propagating names from name vectors is solely algorithm
>>> concern as it stands.
>> 
>> Not sure what you mean.
> 
> Not what you think, it looks like.
> 
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a sequence file. In particular, it means
> that there is a case of Bag[ key -> NamedVector]. Which means, external
> anchor could be saved to either key or name of a row. In practice it causes
> compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
> saves external keys (file paths) into  key, whereas e.g. clustering
> algorithms are not seeing them because they expect them to be the name part
> of the vector. I am just saying we have two ways to name the rows, and it
> is generally not a healthy choice for the aforementioned reason.
> 
> 
>> In my experience Names and Properties are primarily used to store
>> external keys, which are quite healthy.
> 
> Users never have data with Mahout keys, they must constantly go back and
>> forth. This is exactly what the R data frame does, no? I'm not so concerned
>> with being able to address an element by the external key
>> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
>> external ids follow the data through any calculation that makes sense.
>> 
> 
> I am with you on this.
> 
> 
>> This would mean clustering, recommendations, transpose, RSJ would require
>> no id transforming steps. This would make dealing with Mahout much easier.
>> 
> 
> Data frames is a little bit a different thing, right now we work just with
> matrices. Although, yes, our in-core matrices support row and column names
> (just like in R) and distributed matrices support row keys only.  what i
> mean is that algebraic expression e.g.
> 
> Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
> above, but not necessarily named vectors, because internally algorithms
> blockify things into matrix blocks, and i am far from sure that Mahout
> in-core stuff works correctly with named vectors as part of a matrix block
> in all situations. I may be wrong. I always relied on sequence file keys to
> identify data points.
> 
> Note that sequence file keys are bigger than just a name, it is anything
> Writable. I.e. you could save a data structure there, as long as you have a
> Writable for it.
> 
> 
>>> On Apr 2, 2014 1:08 PM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>>> 
>>>> Are the Spark efforts supporting all Mahout Vector types? Named,
>> Property
>>>> Vectors? It occurred to me that data frames in R is a related but more
>>>> general solution. If all rows and columns of a DRM and their
>> coresponding
>>>> Vectors (row or column vectors) were to support arbitrary properties
>>>> attached to them in such a way that they are preserved during
>> transpose,
>>>> Vector extraction, and any other operations that make sense there
>> would be
>>>> a huge benefit for users.
>>>> 
>>>> One of the constant problems with input to Mahout is translation of
>> IDs.
>>>> External to Mahout going in, Mahout to external coming out. Most of
>> this
>>>> would be unneeded if Mahout supported data frames, some would be
>> avoided by
>>>> supporting named or property vectors universally.
>>>> 
>>>> 
>>> 
>> 
> 
>

Re: Data frames

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PS.

sequence file keys have also special meaning if they are Ints. .E.g. A'
physical operator requires keys to be ints, in which case it interprets
them as row indexes that become column indexes. This of course isn't always
the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
reality optimizer will never choose actual transposition as a physical step
in such pipeline. This interpretation is consistent with interpretation of
long-existing Hadoop-side DistributedRowMatrix#transpose.


On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
>
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>>
>> > On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> >
>> > I think this duality, names and keys, is not very healthy really, and
>> just
>> > creates addtutiinal hassle. Spark drm takes care of keys automatically
>> > thoughout, but propagating names from name vectors is solely algorithm
>> > concern as it stands.
>>
>> Not sure what you mean.
>
> Not what you think, it looks like.
>
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a sequence file. In particular, it means
> that there is a case of Bag[ key -> NamedVector]. Which means, external
> anchor could be saved to either key or name of a row. In practice it causes
> compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
> saves external keys (file paths) into  key, whereas e.g. clustering
> algorithms are not seeing them because they expect them to be the name part
> of the vector. I am just saying we have two ways to name the rows, and it
> is generally not a healthy choice for the aforementioned reason.
>
>
>> In my experience Names and Properties are primarily used to store
>> external keys, which are quite healthy.
>
>  Users never have data with Mahout keys, they must constantly go back and
>> forth. This is exactly what the R data frame does, no? I'm not so concerned
>> with being able to address an element by the external key
>> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
>> external ids follow the data through any calculation that makes sense.
>>
>
> I am with you on this.
>
>
>> This would mean clustering, recommendations, transpose, RSJ would require
>> no id transforming steps. This would make dealing with Mahout much easier.
>>
>
> Data frames is a little bit a different thing, right now we work just with
> matrices. Although, yes, our in-core matrices support row and column names
> (just like in R) and distributed matrices support row keys only.  what i
> mean is that algebraic expression e.g.
>
> Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
> above, but not necessarily named vectors, because internally algorithms
> blockify things into matrix blocks, and i am far from sure that Mahout
> in-core stuff works correctly with named vectors as part of a matrix block
> in all situations. I may be wrong. I always relied on sequence file keys to
> identify data points.
>
> Note that sequence file keys are bigger than just a name, it is anything
> Writable. I.e. you could save a data structure there, as long as you have a
> Writable for it.
>
>
>> > On Apr 2, 2014 1:08 PM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>> >
>> >> Are the Spark efforts supporting all Mahout Vector types? Named,
>> Property
>> >> Vectors? It occurred to me that data frames in R is a related but more
>> >> general solution. If all rows and columns of a DRM and their
>> coresponding
>> >> Vectors (row or column vectors) were to support arbitrary properties
>> >> attached to them in such a way that they are preserved during
>> transpose,
>> >> Vector extraction, and any other operations that make sense there
>> would be
>> >> a huge benefit for users.
>> >>
>> >> One of the constant problems with input to Mahout is translation of
>> IDs.
>> >> External to Mahout going in, Mahout to external coming out. Most of
>> this
>> >> would be unneeded if Mahout supported data frames, some would be
>> avoided by
>> >> supporting named or property vectors universally.
>> >>
>> >>
>> >
>>
>
>

Re: Data frames

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

>
> > On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > I think this duality, names and keys, is not very healthy really, and
> just
> > creates addtutiinal hassle. Spark drm takes care of keys automatically
> > thoughout, but propagating names from name vectors is solely algorithm
> > concern as it stands.
>
> Not sure what you mean.

Not what you think, it looks like.

I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
persisted, key goes to the key of a sequence file. In particular, it means
that there is a case of Bag[ key -> NamedVector]. Which means, external
anchor could be saved to either key or name of a row. In practice it causes
compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
saves external keys (file paths) into  key, whereas e.g. clustering
algorithms are not seeing them because they expect them to be the name part
of the vector. I am just saying we have two ways to name the rows, and it
is generally not a healthy choice for the aforementioned reason.

> In my experience Names and Properties are primarily used to store external
> keys, which are quite healthy.

Users never have data with Mahout keys, they must constantly go back and
> forth. This is exactly what the R data frame does, no? I'm not so concerned
> with being able to address an element by the external key
> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
> external ids follow the data through any calculation that makes sense.
>

I am with you on this.

> This would mean clustering, recommendations, transpose, RSJ would require
> no id transforming steps. This would make dealing with Mahout much easier.
>

Data frames is a little bit a different thing, right now we work just with
matrices. Although, yes, our in-core matrices support row and column names
(just like in R) and distributed matrices support row keys only.  what i
mean is that algebraic expression e.g.

Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
above, but not necessarily named vectors, because internally algorithms
blockify things into matrix blocks, and i am far from sure that Mahout
in-core stuff works correctly with named vectors as part of a matrix block
in all situations. I may be wrong. I always relied on sequence file keys to
identify data points.

Note that sequence file keys are bigger than just a name, it is anything
Writable. I.e. you could save a data structure there, as long as you have a
Writable for it.

> > On Apr 2, 2014 1:08 PM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
> >
> >> Are the Spark efforts supporting all Mahout Vector types? Named,
> Property
> >> Vectors? It occurred to me that data frames in R is a related but more
> >> general solution. If all rows and columns of a DRM and their
> coresponding
> >> Vectors (row or column vectors) were to support arbitrary properties
> >> attached to them in such a way that they are preserved during transpose,
> >> Vector extraction, and any other operations that make sense there would
> be
> >> a huge benefit for users.
> >>
> >> One of the constant problems with input to Mahout is translation of IDs.
> >> External to Mahout going in, Mahout to external coming out. Most of this
> >> would be unneeded if Mahout supported data frames, some would be
> avoided by
> >> supporting named or property vectors universally.
> >>
> >>
> >
>

Re: Data frames

Posted by Pat Ferrel <pa...@occamsmachete.com>.

> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> I think this duality, names and keys, is not very healthy really, and just
> creates addtutiinal hassle. Spark drm takes care of keys automatically
> thoughout, but propagating names from name vectors is solely algorithm
> concern as it stands.

Not sure what you mean. In my experience Names and Properties are primarily used to store external keys, which are quite healthy.
Users never have data with Mahout keys, they must constantly go back and forth. This is exactly what the R data frame does, no? I’m not so concerned with being able to address an element by the external key drmB[“pat”][“iPad’] like a HashMap. But it would sure be nice to have the external ids follow the data through any calculation that makes sense. 

This would mean clustering, recommendations, transpose, RSJ would require no id transforming steps. This would make dealing with Mahout much easier.

> On Apr 2, 2014 1:08 PM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
> 
>> Are the Spark efforts supporting all Mahout Vector types? Named, Property
>> Vectors? It occurred to me that data frames in R is a related but more
>> general solution. If all rows and columns of a DRM and their coresponding
>> Vectors (row or column vectors) were to support arbitrary properties
>> attached to them in such a way that they are preserved during transpose,
>> Vector extraction, and any other operations that make sense there would be
>> a huge benefit for users.
>> 
>> One of the constant problems with input to Mahout is translation of IDs.
>> External to Mahout going in, Mahout to external coming out. Most of this
>> would be unneeded if Mahout supported data frames, some would be avoided by
>> supporting named or property vectors universally.
>> 
>> 
>

Re: Data frames

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Anything that is supported my VectorWritable and MatrixWritable. I am sure
named vectors are supported there, not sure about property vectors, i did
not use them.

However, dssvd does not currently take extra effort to propagate names from
A to U for example. It only propagates row keys from a to u.

I think this duality, names and keys, is not very healthy really, and just
creates addtutiinal hassle. Spark drm takes care of keys automatically
thoughout, but propagating names from name vectors is solely algorithm
concern as it stands.
On Apr 2, 2014 1:08 PM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:

> Are the Spark efforts supporting all Mahout Vector types? Named, Property
> Vectors? It occurred to me that data frames in R is a related but more
> general solution. If all rows and columns of a DRM and their coresponding
> Vectors (row or column vectors) were to support arbitrary properties
> attached to them in such a way that they are preserved during transpose,
> Vector extraction, and any other operations that make sense there would be
> a huge benefit for users.
>
> One of the constant problems with input to Mahout is translation of IDs.
> External to Mahout going in, Mahout to external coming out. Most of this
> would be unneeded if Mahout supported data frames, some would be avoided by
> supporting named or property vectors universally.
>
>