You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Mike Khristo <mi...@gmail.com> on 2011/06/01 04:50:40 UTC

Why do userid & itemid have to be long?

Rather, how can I use string-based userid/itemid's without having the deal
with the slowness associated with mapping them to a long?

In the MongoDataModel, for example, significant time/overhead goes into
converting the unique id's to long...  I'm still getting my head wrapped
around mahout, but this seems like a significant limitation. I have to
assume there's some logic behind the decision to restrict them to long, but
i didn't find anything about it in Mahout in Action or the list.

Thanks.

Re: Why do userid & itemid have to be long?

Posted by Lance Norskog <go...@gmail.com>.

UserID and ItemID are usually domain-level keys, not generated by the
DB. With some of the movie databases, you get tables of
"user/item/pref/time", "item/moviename/genre", and maybe
"user/geocode".

Lance

On Tue, May 31, 2011 at 9:51 PM, Mike Khristo <mi...@gmail.com> wrote:
> Using the 0.6 snapshot + patch 705 (mongodatamodel) from jira (
> https://issues.apache.org/jira/browse/MAHOUT-705), and a test data set with
> ~300k rows like:
>
> "4cec0a2934ac9fbd2b040000","4d065d5434ac9f5227a12f00",118
>
> It's slowly doing the translations:
> INFO: [+++][MONGO-MAP] Adding Translation    Item ID:
> 4d57d54434ac9fd3570005a2 long_value: 145367
>
> It's doing about 30,000 per hour (and getting slower). That's 8.3/sec.
> 8G ram, 4 virtual cores
>
> With a test data set of 3M preferences, that would take >5 days, just for
> the translation.
>
> Open to ideas/suggestions/"a-ha"-moments. Thanks!
>
>
>
>
> On Tue, May 31, 2011 at 9:15 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> It makes the internals much cleaner to not repeat this conversion.
>>
>> But how is it that this is taking a long time?  String -> lookup should not
>> be much longer than an array access, especially if you use the Mahout
>> collections or one of the dictionary types.
>>
>> On Tue, May 31, 2011 at 7:50 PM, Mike Khristo <mi...@gmail.com>
>> wrote:
>>
>> > Rather, how can I use string-based userid/itemid's without having the
>> deal
>> > with the slowness associated with mapping them to a long?
>> >
>> > In the MongoDataModel, for example, significant time/overhead goes into
>> > converting the unique id's to long...  I'm still getting my head wrapped
>> > around mahout, but this seems like a significant limitation. I have to
>> > assume there's some logic behind the decision to restrict them to long,
>> but
>> > i didn't find anything about it in Mahout in Action or the list.
>> >
>> > Thanks.
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Why do userid & itemid have to be long?

Posted by Ted Dunning <te...@gmail.com>.

Preallocation matters almost not at all for Java HashMaps.

This is just slow.  Full stop.

On Tue, May 31, 2011 at 10:13 PM, Lance Norskog <go...@gmail.com> wrote:

> Also, you can tell HashMap you'll be adding lots of entries when it
> starts. This might help it run faster. But, yes, this is bizarrely
> slow.
>

Re: Why do userid & itemid have to be long?

Posted by Lance Norskog <go...@gmail.com>.

Could it be doing lots of garbage collection? Have you monitored the
JVMs while this takes so long?
Also, you can tell HashMap you'll be adding lots of entries when it
starts. This might help it run faster. But, yes, this is bizarrely
slow.

On Tue, May 31, 2011 at 10:08 PM, Chris Schilling
<ch...@thecleversense.com> wrote:
> I have a test set of 6M preferences (500k users, 500k items).  I recently
> switched my infrastructure to use Long sequential ids for users and items.
> Before this we were using Strings.  I was able to read in a map file for
> userIds and itemIds into a Java HashMap.  Conversions took negligible amount
> of time.  This sounds insance for only 5M prefs.
>
>
>
> On Tue, May 31, 2011 at 9:51 PM, Mike Khristo <mi...@gmail.com> wrote:
>
>> Using the 0.6 snapshot + patch 705 (mongodatamodel) from jira (
>> https://issues.apache.org/jira/browse/MAHOUT-705), and a test data set
>> with
>> ~300k rows like:
>>
>> "4cec0a2934ac9fbd2b040000","4d065d5434ac9f5227a12f00",118
>>
>> It's slowly doing the translations:
>> INFO: [+++][MONGO-MAP] Adding Translation    Item ID:
>> 4d57d54434ac9fd3570005a2 long_value: 145367
>>
>> It's doing about 30,000 per hour (and getting slower). That's 8.3/sec.
>> 8G ram, 4 virtual cores
>>
>> With a test data set of 3M preferences, that would take >5 days, just for
>> the translation.
>>
>> Open to ideas/suggestions/"a-ha"-moments. Thanks!
>>
>>
>>
>>
>> On Tue, May 31, 2011 at 9:15 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> > It makes the internals much cleaner to not repeat this conversion.
>> >
>> > But how is it that this is taking a long time?  String -> lookup should
>> not
>> > be much longer than an array access, especially if you use the Mahout
>> > collections or one of the dictionary types.
>> >
>> > On Tue, May 31, 2011 at 7:50 PM, Mike Khristo <mi...@gmail.com>
>> > wrote:
>> >
>> > > Rather, how can I use string-based userid/itemid's without having the
>> > deal
>> > > with the slowness associated with mapping them to a long?
>> > >
>> > > In the MongoDataModel, for example, significant time/overhead goes into
>> > > converting the unique id's to long...  I'm still getting my head
>> wrapped
>> > > around mahout, but this seems like a significant limitation. I have to
>> > > assume there's some logic behind the decision to restrict them to long,
>> > but
>> > > i didn't find anything about it in Mahout in Action or the list.
>> > >
>> > > Thanks.
>> > >
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Why do userid & itemid have to be long?

Posted by Chris Schilling <ch...@thecleversense.com>.

I have a test set of 6M preferences (500k users, 500k items).  I recently
switched my infrastructure to use Long sequential ids for users and items.
Before this we were using Strings.  I was able to read in a map file for
userIds and itemIds into a Java HashMap.  Conversions took negligible amount
of time.  This sounds insance for only 5M prefs.



On Tue, May 31, 2011 at 9:51 PM, Mike Khristo <mi...@gmail.com> wrote:

> Using the 0.6 snapshot + patch 705 (mongodatamodel) from jira (
> https://issues.apache.org/jira/browse/MAHOUT-705), and a test data set
> with
> ~300k rows like:
>
> "4cec0a2934ac9fbd2b040000","4d065d5434ac9f5227a12f00",118
>
> It's slowly doing the translations:
> INFO: [+++][MONGO-MAP] Adding Translation    Item ID:
> 4d57d54434ac9fd3570005a2 long_value: 145367
>
> It's doing about 30,000 per hour (and getting slower). That's 8.3/sec.
> 8G ram, 4 virtual cores
>
> With a test data set of 3M preferences, that would take >5 days, just for
> the translation.
>
> Open to ideas/suggestions/"a-ha"-moments. Thanks!
>
>
>
>
> On Tue, May 31, 2011 at 9:15 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > It makes the internals much cleaner to not repeat this conversion.
> >
> > But how is it that this is taking a long time?  String -> lookup should
> not
> > be much longer than an array access, especially if you use the Mahout
> > collections or one of the dictionary types.
> >
> > On Tue, May 31, 2011 at 7:50 PM, Mike Khristo <mi...@gmail.com>
> > wrote:
> >
> > > Rather, how can I use string-based userid/itemid's without having the
> > deal
> > > with the slowness associated with mapping them to a long?
> > >
> > > In the MongoDataModel, for example, significant time/overhead goes into
> > > converting the unique id's to long...  I'm still getting my head
> wrapped
> > > around mahout, but this seems like a significant limitation. I have to
> > > assume there's some logic behind the decision to restrict them to long,
> > but
> > > i didn't find anything about it in Mahout in Action or the list.
> > >
> > > Thanks.
> > >
> >
>

Re: Why do userid & itemid have to be long?

Posted by Ted Dunning <te...@gmail.com>.

That's better, but still pretty slow.

On Tue, May 31, 2011 at 11:16 PM, Mike Khristo <mi...@gmail.com>wrote:

> I haven't modified the patch, but yes, it appears to be storing the
> translations into a collection it creates (mongo_data_model_map):
> https://issues.apache.org/jira/secure/attachment/12479895/MAHOUT-705.patch
>
> The patch doesn't put any indexes on the mongoMapCollection.
>
> Just added the following:
> db.mongo_data_model_map.ensureIndex({element_id : 1})
> db.mongo_data_model_map.ensureIndex({long_value : 1})
>
> Now's it's doing about 50k translations per minute (as opposed to 30k per
> hour).
>
>
>
>
>
> On Tue, May 31, 2011 at 11:01 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Are you putting the translations into Mongo?
> >
> > On Tue, May 31, 2011 at 9:51 PM, Mike Khristo <mi...@gmail.com>
> > wrote:
> >
> > > Using the 0.6 snapshot + patch 705 (mongodatamodel) from jira (
> > > https://issues.apache.org/jira/browse/MAHOUT-705), and a test data set
> > > with
> > > ~300k rows like:
> > >
> > > "4cec0a2934ac9fbd2b040000","4d065d5434ac9f5227a12f00",118
> > >
> > > It's slowly doing the translations:
> > > INFO: [+++][MONGO-MAP] Adding Translation    Item ID:
> > > 4d57d54434ac9fd3570005a2 long_value: 145367
> > >
> > > It's doing about 30,000 per hour (and getting slower). That's 8.3/sec.
> > > 8G ram, 4 virtual cores
> > >
> > > With a test data set of 3M preferences, that would take >5 days, just
> for
> > > the translation.
> > >
> > > Open to ideas/suggestions/"a-ha"-moments. Thanks!
> > >
> > >
> > >
> > >
> > > On Tue, May 31, 2011 at 9:15 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > It makes the internals much cleaner to not repeat this conversion.
> > > >
> > > > But how is it that this is taking a long time?  String -> lookup
> should
> > > not
> > > > be much longer than an array access, especially if you use the Mahout
> > > > collections or one of the dictionary types.
> > > >
> > > > On Tue, May 31, 2011 at 7:50 PM, Mike Khristo <mikekhristo@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Rather, how can I use string-based userid/itemid's without having
> the
> > > > deal
> > > > > with the slowness associated with mapping them to a long?
> > > > >
> > > > > In the MongoDataModel, for example, significant time/overhead goes
> > into
> > > > > converting the unique id's to long...  I'm still getting my head
> > > wrapped
> > > > > around mahout, but this seems like a significant limitation. I have
> > to
> > > > > assume there's some logic behind the decision to restrict them to
> > long,
> > > > but
> > > > > i didn't find anything about it in Mahout in Action or the list.
> > > > >
> > > > > Thanks.
> > > > >
> > > >
> > >
> >
>

Re: Why do userid & itemid have to be long?

Posted by Mike Khristo <mi...@gmail.com>.

I haven't modified the patch, but yes, it appears to be storing the
translations into a collection it creates (mongo_data_model_map):
https://issues.apache.org/jira/secure/attachment/12479895/MAHOUT-705.patch

The patch doesn't put any indexes on the mongoMapCollection.

Just added the following:
db.mongo_data_model_map.ensureIndex({element_id : 1})
db.mongo_data_model_map.ensureIndex({long_value : 1})

Now's it's doing about 50k translations per minute (as opposed to 30k per
hour).





On Tue, May 31, 2011 at 11:01 PM, Ted Dunning <te...@gmail.com> wrote:

> Are you putting the translations into Mongo?
>
> On Tue, May 31, 2011 at 9:51 PM, Mike Khristo <mi...@gmail.com>
> wrote:
>
> > Using the 0.6 snapshot + patch 705 (mongodatamodel) from jira (
> > https://issues.apache.org/jira/browse/MAHOUT-705), and a test data set
> > with
> > ~300k rows like:
> >
> > "4cec0a2934ac9fbd2b040000","4d065d5434ac9f5227a12f00",118
> >
> > It's slowly doing the translations:
> > INFO: [+++][MONGO-MAP] Adding Translation    Item ID:
> > 4d57d54434ac9fd3570005a2 long_value: 145367
> >
> > It's doing about 30,000 per hour (and getting slower). That's 8.3/sec.
> > 8G ram, 4 virtual cores
> >
> > With a test data set of 3M preferences, that would take >5 days, just for
> > the translation.
> >
> > Open to ideas/suggestions/"a-ha"-moments. Thanks!
> >
> >
> >
> >
> > On Tue, May 31, 2011 at 9:15 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > It makes the internals much cleaner to not repeat this conversion.
> > >
> > > But how is it that this is taking a long time?  String -> lookup should
> > not
> > > be much longer than an array access, especially if you use the Mahout
> > > collections or one of the dictionary types.
> > >
> > > On Tue, May 31, 2011 at 7:50 PM, Mike Khristo <mi...@gmail.com>
> > > wrote:
> > >
> > > > Rather, how can I use string-based userid/itemid's without having the
> > > deal
> > > > with the slowness associated with mapping them to a long?
> > > >
> > > > In the MongoDataModel, for example, significant time/overhead goes
> into
> > > > converting the unique id's to long...  I'm still getting my head
> > wrapped
> > > > around mahout, but this seems like a significant limitation. I have
> to
> > > > assume there's some logic behind the decision to restrict them to
> long,
> > > but
> > > > i didn't find anything about it in Mahout in Action or the list.
> > > >
> > > > Thanks.
> > > >
> > >
> >
>

Re: Why do userid & itemid have to be long?

Posted by Ted Dunning <te...@gmail.com>.

Are you putting the translations into Mongo?

On Tue, May 31, 2011 at 9:51 PM, Mike Khristo <mi...@gmail.com> wrote:

> Using the 0.6 snapshot + patch 705 (mongodatamodel) from jira (
> https://issues.apache.org/jira/browse/MAHOUT-705), and a test data set
> with
> ~300k rows like:
>
> "4cec0a2934ac9fbd2b040000","4d065d5434ac9f5227a12f00",118
>
> It's slowly doing the translations:
> INFO: [+++][MONGO-MAP] Adding Translation    Item ID:
> 4d57d54434ac9fd3570005a2 long_value: 145367
>
> It's doing about 30,000 per hour (and getting slower). That's 8.3/sec.
> 8G ram, 4 virtual cores
>
> With a test data set of 3M preferences, that would take >5 days, just for
> the translation.
>
> Open to ideas/suggestions/"a-ha"-moments. Thanks!
>
>
>
>
> On Tue, May 31, 2011 at 9:15 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > It makes the internals much cleaner to not repeat this conversion.
> >
> > But how is it that this is taking a long time?  String -> lookup should
> not
> > be much longer than an array access, especially if you use the Mahout
> > collections or one of the dictionary types.
> >
> > On Tue, May 31, 2011 at 7:50 PM, Mike Khristo <mi...@gmail.com>
> > wrote:
> >
> > > Rather, how can I use string-based userid/itemid's without having the
> > deal
> > > with the slowness associated with mapping them to a long?
> > >
> > > In the MongoDataModel, for example, significant time/overhead goes into
> > > converting the unique id's to long...  I'm still getting my head
> wrapped
> > > around mahout, but this seems like a significant limitation. I have to
> > > assume there's some logic behind the decision to restrict them to long,
> > but
> > > i didn't find anything about it in Mahout in Action or the list.
> > >
> > > Thanks.
> > >
> >
>

Re: Why do userid & itemid have to be long?

Posted by Mike Khristo <mi...@gmail.com>.

Using the 0.6 snapshot + patch 705 (mongodatamodel) from jira (
https://issues.apache.org/jira/browse/MAHOUT-705), and a test data set with
~300k rows like:

"4cec0a2934ac9fbd2b040000","4d065d5434ac9f5227a12f00",118

It's slowly doing the translations:
INFO: [+++][MONGO-MAP] Adding Translation    Item ID:
4d57d54434ac9fd3570005a2 long_value: 145367

It's doing about 30,000 per hour (and getting slower). That's 8.3/sec.
8G ram, 4 virtual cores

With a test data set of 3M preferences, that would take >5 days, just for
the translation.

Open to ideas/suggestions/"a-ha"-moments. Thanks!




On Tue, May 31, 2011 at 9:15 PM, Ted Dunning <te...@gmail.com> wrote:

> It makes the internals much cleaner to not repeat this conversion.
>
> But how is it that this is taking a long time?  String -> lookup should not
> be much longer than an array access, especially if you use the Mahout
> collections or one of the dictionary types.
>
> On Tue, May 31, 2011 at 7:50 PM, Mike Khristo <mi...@gmail.com>
> wrote:
>
> > Rather, how can I use string-based userid/itemid's without having the
> deal
> > with the slowness associated with mapping them to a long?
> >
> > In the MongoDataModel, for example, significant time/overhead goes into
> > converting the unique id's to long...  I'm still getting my head wrapped
> > around mahout, but this seems like a significant limitation. I have to
> > assume there's some logic behind the decision to restrict them to long,
> but
> > i didn't find anything about it in Mahout in Action or the list.
> >
> > Thanks.
> >
>

Re: Why do userid & itemid have to be long?

Posted by Ted Dunning <te...@gmail.com>.

It makes the internals much cleaner to not repeat this conversion.

But how is it that this is taking a long time?  String -> lookup should not
be much longer than an array access, especially if you use the Mahout
collections or one of the dictionary types.

On Tue, May 31, 2011 at 7:50 PM, Mike Khristo <mi...@gmail.com> wrote:

> Rather, how can I use string-based userid/itemid's without having the deal
> with the slowness associated with mapping them to a long?
>
> In the MongoDataModel, for example, significant time/overhead goes into
> converting the unique id's to long...  I'm still getting my head wrapped
> around mahout, but this seems like a significant limitation. I have to
> assume there's some logic behind the decision to restrict them to long, but
> i didn't find anything about it in Mahout in Action or the list.
>
> Thanks.
>

Re: Why do userid & itemid have to be long?

Posted by Sean Owen <sr...@gmail.com>.

It is for performance -- it used to allow any Comparable type but the object
overhead slowed things down by 2-3x.
It looks like you are using integer values already in Mongo, am I reading
that right? those look like 12-byte hex values. Is it a question of
reading/writing them as such then rather than treating as strings in Mongo?
If you really have to convert such a thing to/from String, I bet that
writing your own simple encoder/decoder runs much faster.

On Wed, Jun 1, 2011 at 3:50 AM, Mike Khristo <mi...@gmail.com> wrote:

> Rather, how can I use string-based userid/itemid's without having the deal
> with the slowness associated with mapping them to a long?
>
> In the MongoDataModel, for example, significant time/overhead goes into
> converting the unique id's to long...  I'm still getting my head wrapped
> around mahout, but this seems like a significant limitation. I have to
> assume there's some logic behind the decision to restrict them to long, but
> i didn't find anything about it in Mahout in Action or the list.
>
> Thanks.
>