You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2009/12/05 13:48:08 UTC

Retrieving labels for indexes?

I'm trying to use Vectors to represent a vector of user preferences.
All is well since items are numeric and can be used as indexes into a
Vector -- almost. I have longs, and of course indexes are ints.

I could fold the long IDs into ints without too much worry about the
effects of collision. However I still need to remember the original
item IDs for each index. I could do it with labels, but I can't
retrieve the label for an index (and the other mapping isn't
serialized anyway?).

So I guess I must separately store this mapping? Just making sure I'm
not missing something.

Re: Retrieving labels for indexes?

Posted by Ted Dunning <te...@gmail.com>.

I think that we should go with a labeling layer for these sorts of
applications and not mess with the underlying matrix representation.

On Tue, Dec 8, 2009 at 12:50 PM, Jake Mannix <ja...@gmail.com> wrote:

> For columns of a row-based matrix, I'm down with hashing or whatever.  For
> the rows on such matrices, inverting this is sometimes necessary (as Sean's
> case shows).  I'd hate to have an api with long row indexes and int column
> indices though, that would be unacceptable.
>
>  -jake
>
> On Tue, Dec 8, 2009 at 11:10 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Systems like Vowpal Wabbit already support billions (and more) features,
> > but
> > they do it with the hashing trick and deal with possible collisions by
> > multiple hashing.  They claim support for as many as 10^12 features.
> >
> > As long as it is possible to avoid the overhead, I would be +0.  If the
> > overhead applies to all tasks then I would be -1.
> >
> > Scalability is quite possible without this.
> >
> > On Tue, Dec 8, 2009 at 3:08 AM, Grant Ingersoll <gs...@apache.org>
> > wrote:
> >
> > > How hard would it be to transparently support both?  Could we have one
> > > implementation for "smaller" problems and one for larger?
> > >
> > > At any rate, +1 to making this be available for really large scale.
> > >
> > > -Grant
> > >
> > > On Dec 8, 2009, at 3:16 AM, Sean Owen wrote:
> > >
> > > > I'm sure it's not hard. It makes (sparse) vectors consume that much
> > > > more memory though.
> > > >
> > > > This change would certainly help my case, but I already have a bit of
> > > > a workaround: I hash longs into ints and store the reverse mapping.
> > > > There is possibility of collision but the consequence is small in the
> > > > context of collaborative filtering.
> > > >
> > > > I suppose if I'm the only use case that would benefit at the moment,
> > > > maybe not worth it, but if you can think of other reasons, let's
> > > > change.
> > > >
> > > > On Tue, Dec 8, 2009 at 5:48 AM, Jake Mannix <ja...@gmail.com>
> > > wrote:
> > > >> This brings up a point about our linear primitives: are 32bit
> integers
> > > big
> > > >> enough for our index range for vectors and matrices?  Especially for
> > > >> matrices,
> > > >> having billions of rows is completely possible, even if it is on the
> > > large
> > > >> side.
> > > >>
> > > >> If we want to be about "scalable" machine learning, we really don't
> > want
> > > to
> > > >> seal ourselves in to "only" 2 billion x 2 billion matrices in the
> long
> > > run,
> > > >> do we?
> > > >>
> > > >> How hard would it be to promote our ints to longs?
> > > >>
> > > >>  -jake
> > > >>
> > > >> On Sat, Dec 5, 2009 at 4:48 AM, Sean Owen <sr...@gmail.com> wrote:
> > > >>
> > > >>> I'm trying to use Vectors to represent a vector of user
> preferences.
> > > >>> All is well since items are numeric and can be used as indexes into
> a
> > > >>> Vector -- almost. I have longs, and of course indexes are ints.
> > > >>>
> > > >>> I could fold the long IDs into ints without too much worry about
> the
> > > >>> effects of collision. However I still need to remember the original
> > > >>> item IDs for each index. I could do it with labels, but I can't
> > > >>> retrieve the label for an index (and the other mapping isn't
> > > >>> serialized anyway?).
> > > >>>
> > > >>> So I guess I must separately store this mapping? Just making sure
> I'm
> > > >>> not missing something.
> > > >>>
> > > >>
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://www.lucidimagination.com/
> > >
> > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using
> > > Solr/Lucene:
> > > http://www.lucidimagination.com/search
> > >
> > >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Retrieving labels for indexes?

Posted by Jake Mannix <ja...@gmail.com>.

For columns of a row-based matrix, I'm down with hashing or whatever.  For
the rows on such matrices, inverting this is sometimes necessary (as Sean's
case shows).  I'd hate to have an api with long row indexes and int column
indices though, that would be unacceptable.

  -jake

On Tue, Dec 8, 2009 at 11:10 AM, Ted Dunning <te...@gmail.com> wrote:

> Systems like Vowpal Wabbit already support billions (and more) features,
> but
> they do it with the hashing trick and deal with possible collisions by
> multiple hashing.  They claim support for as many as 10^12 features.
>
> As long as it is possible to avoid the overhead, I would be +0.  If the
> overhead applies to all tasks then I would be -1.
>
> Scalability is quite possible without this.
>
> On Tue, Dec 8, 2009 at 3:08 AM, Grant Ingersoll <gs...@apache.org>
> wrote:
>
> > How hard would it be to transparently support both?  Could we have one
> > implementation for "smaller" problems and one for larger?
> >
> > At any rate, +1 to making this be available for really large scale.
> >
> > -Grant
> >
> > On Dec 8, 2009, at 3:16 AM, Sean Owen wrote:
> >
> > > I'm sure it's not hard. It makes (sparse) vectors consume that much
> > > more memory though.
> > >
> > > This change would certainly help my case, but I already have a bit of
> > > a workaround: I hash longs into ints and store the reverse mapping.
> > > There is possibility of collision but the consequence is small in the
> > > context of collaborative filtering.
> > >
> > > I suppose if I'm the only use case that would benefit at the moment,
> > > maybe not worth it, but if you can think of other reasons, let's
> > > change.
> > >
> > > On Tue, Dec 8, 2009 at 5:48 AM, Jake Mannix <ja...@gmail.com>
> > wrote:
> > >> This brings up a point about our linear primitives: are 32bit integers
> > big
> > >> enough for our index range for vectors and matrices?  Especially for
> > >> matrices,
> > >> having billions of rows is completely possible, even if it is on the
> > large
> > >> side.
> > >>
> > >> If we want to be about "scalable" machine learning, we really don't
> want
> > to
> > >> seal ourselves in to "only" 2 billion x 2 billion matrices in the long
> > run,
> > >> do we?
> > >>
> > >> How hard would it be to promote our ints to longs?
> > >>
> > >>  -jake
> > >>
> > >> On Sat, Dec 5, 2009 at 4:48 AM, Sean Owen <sr...@gmail.com> wrote:
> > >>
> > >>> I'm trying to use Vectors to represent a vector of user preferences.
> > >>> All is well since items are numeric and can be used as indexes into a
> > >>> Vector -- almost. I have longs, and of course indexes are ints.
> > >>>
> > >>> I could fold the long IDs into ints without too much worry about the
> > >>> effects of collision. However I still need to remember the original
> > >>> item IDs for each index. I could do it with labels, but I can't
> > >>> retrieve the label for an index (and the other mapping isn't
> > >>> serialized anyway?).
> > >>>
> > >>> So I guess I must separately store this mapping? Just making sure I'm
> > >>> not missing something.
> > >>>
> > >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> > Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Retrieving labels for indexes?

Posted by Ted Dunning <te...@gmail.com>.

Systems like Vowpal Wabbit already support billions (and more) features, but
they do it with the hashing trick and deal with possible collisions by
multiple hashing.  They claim support for as many as 10^12 features.

As long as it is possible to avoid the overhead, I would be +0.  If the
overhead applies to all tasks then I would be -1.

Scalability is quite possible without this.

On Tue, Dec 8, 2009 at 3:08 AM, Grant Ingersoll <gs...@apache.org> wrote:

> How hard would it be to transparently support both?  Could we have one
> implementation for "smaller" problems and one for larger?
>
> At any rate, +1 to making this be available for really large scale.
>
> -Grant
>
> On Dec 8, 2009, at 3:16 AM, Sean Owen wrote:
>
> > I'm sure it's not hard. It makes (sparse) vectors consume that much
> > more memory though.
> >
> > This change would certainly help my case, but I already have a bit of
> > a workaround: I hash longs into ints and store the reverse mapping.
> > There is possibility of collision but the consequence is small in the
> > context of collaborative filtering.
> >
> > I suppose if I'm the only use case that would benefit at the moment,
> > maybe not worth it, but if you can think of other reasons, let's
> > change.
> >
> > On Tue, Dec 8, 2009 at 5:48 AM, Jake Mannix <ja...@gmail.com>
> wrote:
> >> This brings up a point about our linear primitives: are 32bit integers
> big
> >> enough for our index range for vectors and matrices?  Especially for
> >> matrices,
> >> having billions of rows is completely possible, even if it is on the
> large
> >> side.
> >>
> >> If we want to be about "scalable" machine learning, we really don't want
> to
> >> seal ourselves in to "only" 2 billion x 2 billion matrices in the long
> run,
> >> do we?
> >>
> >> How hard would it be to promote our ints to longs?
> >>
> >>  -jake
> >>
> >> On Sat, Dec 5, 2009 at 4:48 AM, Sean Owen <sr...@gmail.com> wrote:
> >>
> >>> I'm trying to use Vectors to represent a vector of user preferences.
> >>> All is well since items are numeric and can be used as indexes into a
> >>> Vector -- almost. I have longs, and of course indexes are ints.
> >>>
> >>> I could fold the long IDs into ints without too much worry about the
> >>> effects of collision. However I still need to remember the original
> >>> item IDs for each index. I could do it with labels, but I can't
> >>> retrieve the label for an index (and the other mapping isn't
> >>> serialized anyway?).
> >>>
> >>> So I guess I must separately store this mapping? Just making sure I'm
> >>> not missing something.
> >>>
> >>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: Retrieving labels for indexes?

Posted by Sean Owen <sr...@gmail.com>.

You'd need a copy of the class with longs for ints. Seems a little
yucky. I suppose I'm waiting to see if there's any other demand for
it, then proceed if so.

On Tue, Dec 8, 2009 at 11:08 AM, Grant Ingersoll <gs...@apache.org> wrote:
> How hard would it be to transparently support both?  Could we have one implementation for "smaller" problems and one for larger?
>
> At any rate, +1 to making this be available for really large scale.
>
> -Grant
>
> On Dec 8, 2009, at 3:16 AM, Sean Owen wrote:
>
>> I'm sure it's not hard. It makes (sparse) vectors consume that much
>> more memory though.
>>
>> This change would certainly help my case, but I already have a bit of
>> a workaround: I hash longs into ints and store the reverse mapping.
>> There is possibility of collision but the consequence is small in the
>> context of collaborative filtering.
>>
>> I suppose if I'm the only use case that would benefit at the moment,
>> maybe not worth it, but if you can think of other reasons, let's
>> change.
>>
>> On Tue, Dec 8, 2009 at 5:48 AM, Jake Mannix <ja...@gmail.com> wrote:
>>> This brings up a point about our linear primitives: are 32bit integers big
>>> enough for our index range for vectors and matrices?  Especially for
>>> matrices,
>>> having billions of rows is completely possible, even if it is on the large
>>> side.
>>>
>>> If we want to be about "scalable" machine learning, we really don't want to
>>> seal ourselves in to "only" 2 billion x 2 billion matrices in the long run,
>>> do we?
>>>
>>> How hard would it be to promote our ints to longs?
>>>
>>>  -jake
>>>
>>> On Sat, Dec 5, 2009 at 4:48 AM, Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> I'm trying to use Vectors to represent a vector of user preferences.
>>>> All is well since items are numeric and can be used as indexes into a
>>>> Vector -- almost. I have longs, and of course indexes are ints.
>>>>
>>>> I could fold the long IDs into ints without too much worry about the
>>>> effects of collision. However I still need to remember the original
>>>> item IDs for each index. I could do it with labels, but I can't
>>>> retrieve the label for an index (and the other mapping isn't
>>>> serialized anyway?).
>>>>
>>>> So I guess I must separately store this mapping? Just making sure I'm
>>>> not missing something.
>>>>
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Retrieving labels for indexes?

Posted by Grant Ingersoll <gs...@apache.org>.

How hard would it be to transparently support both?  Could we have one implementation for "smaller" problems and one for larger?

At any rate, +1 to making this be available for really large scale.

-Grant

On Dec 8, 2009, at 3:16 AM, Sean Owen wrote:

> I'm sure it's not hard. It makes (sparse) vectors consume that much
> more memory though.
> 
> This change would certainly help my case, but I already have a bit of
> a workaround: I hash longs into ints and store the reverse mapping.
> There is possibility of collision but the consequence is small in the
> context of collaborative filtering.
> 
> I suppose if I'm the only use case that would benefit at the moment,
> maybe not worth it, but if you can think of other reasons, let's
> change.
> 
> On Tue, Dec 8, 2009 at 5:48 AM, Jake Mannix <ja...@gmail.com> wrote:
>> This brings up a point about our linear primitives: are 32bit integers big
>> enough for our index range for vectors and matrices?  Especially for
>> matrices,
>> having billions of rows is completely possible, even if it is on the large
>> side.
>> 
>> If we want to be about "scalable" machine learning, we really don't want to
>> seal ourselves in to "only" 2 billion x 2 billion matrices in the long run,
>> do we?
>> 
>> How hard would it be to promote our ints to longs?
>> 
>>  -jake
>> 
>> On Sat, Dec 5, 2009 at 4:48 AM, Sean Owen <sr...@gmail.com> wrote:
>> 
>>> I'm trying to use Vectors to represent a vector of user preferences.
>>> All is well since items are numeric and can be used as indexes into a
>>> Vector -- almost. I have longs, and of course indexes are ints.
>>> 
>>> I could fold the long IDs into ints without too much worry about the
>>> effects of collision. However I still need to remember the original
>>> item IDs for each index. I could do it with labels, but I can't
>>> retrieve the label for an index (and the other mapping isn't
>>> serialized anyway?).
>>> 
>>> So I guess I must separately store this mapping? Just making sure I'm
>>> not missing something.
>>> 
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Retrieving labels for indexes?

Posted by Sean Owen <sr...@gmail.com>.

I'm sure it's not hard. It makes (sparse) vectors consume that much
more memory though.

This change would certainly help my case, but I already have a bit of
a workaround: I hash longs into ints and store the reverse mapping.
There is possibility of collision but the consequence is small in the
context of collaborative filtering.

I suppose if I'm the only use case that would benefit at the moment,
maybe not worth it, but if you can think of other reasons, let's
change.

On Tue, Dec 8, 2009 at 5:48 AM, Jake Mannix <ja...@gmail.com> wrote:
> This brings up a point about our linear primitives: are 32bit integers big
> enough for our index range for vectors and matrices?  Especially for
> matrices,
> having billions of rows is completely possible, even if it is on the large
> side.
>
> If we want to be about "scalable" machine learning, we really don't want to
> seal ourselves in to "only" 2 billion x 2 billion matrices in the long run,
> do we?
>
> How hard would it be to promote our ints to longs?
>
>  -jake
>
> On Sat, Dec 5, 2009 at 4:48 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> I'm trying to use Vectors to represent a vector of user preferences.
>> All is well since items are numeric and can be used as indexes into a
>> Vector -- almost. I have longs, and of course indexes are ints.
>>
>> I could fold the long IDs into ints without too much worry about the
>> effects of collision. However I still need to remember the original
>> item IDs for each index. I could do it with labels, but I can't
>> retrieve the label for an index (and the other mapping isn't
>> serialized anyway?).
>>
>> So I guess I must separately store this mapping? Just making sure I'm
>> not missing something.
>>
>

Re: Retrieving labels for indexes?

Posted by Jake Mannix <ja...@gmail.com>.

This brings up a point about our linear primitives: are 32bit integers big
enough for our index range for vectors and matrices?  Especially for
matrices,
having billions of rows is completely possible, even if it is on the large
side.

If we want to be about "scalable" machine learning, we really don't want to
seal ourselves in to "only" 2 billion x 2 billion matrices in the long run,
do we?

How hard would it be to promote our ints to longs?

  -jake

On Sat, Dec 5, 2009 at 4:48 AM, Sean Owen <sr...@gmail.com> wrote:

> I'm trying to use Vectors to represent a vector of user preferences.
> All is well since items are numeric and can be used as indexes into a
> Vector -- almost. I have longs, and of course indexes are ints.
>
> I could fold the long IDs into ints without too much worry about the
> effects of collision. However I still need to remember the original
> item IDs for each index. I could do it with labels, but I can't
> retrieve the label for an index (and the other mapping isn't
> serialized anyway?).
>
> So I guess I must separately store this mapping? Just making sure I'm
> not missing something.
>