You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Matt Mitchell <go...@gmail.com> on 2012/08/01 22:29:23 UTC

UUID based user IDs

Question about dealing with UUIDs as Mahout user IDs. I'm considering
ways to deal with these values:

1. use getLeastSignificantBits
2. re-map to a database auto-increment number (this would take very
long time to do?)
3. customize mahout so that it accepts UUIDs as user IDs

Any feedback here? If I went with #3 (seems the safest) how would I do
this and, what are the consequences?

The user count is in the millions.

Thanks!

Re: UUID based user IDs

Posted by Manuel Blechschmidt <Ma...@gmx.de>.
Hi Matt,
when you are creating your preferences (normally about millions of preferences) from your data you always have to convert the UUID to longs before you create them.

The given example is doing that always on the fly when the recommender process is started and saves the mapping in memory.

When receiving recommendations you have to convert the long id back to the UUID:

...
                        List<RecommendedItem> items = recommender.recommend(thing2long.toLongID(personName), 10);
			for(RecommendedItem item : items) {
				recommendations.add(thing2long.toStringID(item.getItemID()));
			}
...

I would recommend that you just clone the example and play around with it. You can also run the test cases with a debugger and have a look what is happening.

git clone git://github.com/ManuelB/facebook-recommender-demo.git
cd facebook-recommender-demo
mvn install
mvn embedded-glassfish:run 

The github project contains an eclipse configuration so it should be easily loadable.

/Manuel

On 02.08.2012, at 04:40, Matt Mitchell wrote:

> Thanks Manuel, that's very helpful. So you're saying I can just use
> MemoryIDMigrator, even after my preferences have bee created with UUID
> values? Or, should I create my preferences using the MemoryIDMigrator?
> 
> - Matt
> 
> 
> On Wed, Aug 1, 2012 at 8:49 PM, Manuel Blechschmidt
> <Ma...@gmx.de> wrote:
>> Hello Matt,
>> 
>> On 01.08.2012, at 22:40, Matt Mitchell wrote:
>> 
>>> Thanks Sean! That all makes sense. Would you mind recommended a
>>> hashing function for this? Is there something in Mahout I could use?
>> 
>> The following class uses an string to long mapping based on a MemoryIDMigrator:
>> 
>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/src/main/java/de/apaxo/bedcon/FacebookRecommender.java
>> 
>> Internally mahout uses parts of the md5 hashes. Which can be fir example directly expressed in SQL:
>> 
>> cast(conv(substring(md5([column name]), 1, 16),16,10) as signed)
>> 
>> Javadoc can be found here:
>> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/model/IDMigrator.html
>> 
>> /Manuel
>> 
>>> 
>>> - Matt
>>> 
>>> On Wed, Aug 1, 2012 at 4:34 PM, Sean Owen <sr...@gmail.com> wrote:
>>>> Yep, just hash to a long, from UUID or String or whatever. The occasional
>>>> collision does not cause a real problem. If you mix the tastes of two users
>>>> or items once in a billion times, the overall results will hardly be
>>>> different.
>>>> 
>>>> You have to maintain the reverse mapping of course. Look at the IDMigrator
>>>> class for a little help there.
>>>> 
>>>> You can rewrite to use UUID or String, but believe me, it will be an
>>>> immense amount of change and make things much slower. It used to work this
>>>> way for recommenders in about 2006 and the Object overhead and GC pressure
>>>> was by far the bottleneck. That's why it's all long now.
>>>> 
>>>> On Wed, Aug 1, 2012 at 9:29 PM, Matt Mitchell <go...@gmail.com> wrote:
>>>> 
>>>>> Question about dealing with UUIDs as Mahout user IDs. I'm considering
>>>>> ways to deal with these values:
>>>>> 
>>>>> 1. use getLeastSignificantBits
>>>>> 2. re-map to a database auto-increment number (this would take very
>>>>> long time to do?)
>>>>> 3. customize mahout so that it accepts UUIDs as user IDs
>>>>> 
>>>>> Any feedback here? If I went with #3 (seems the safest) how would I do
>>>>> this and, what are the consequences?
>>>>> 
>>>>> The user count is in the millions.
>>>>> 
>>>>> Thanks!
>>>>> 
>> 
>> --
>> Manuel Blechschmidt
>> M.Sc. IT Systems Engineering
>> Dortustr. 57
>> 14467 Potsdam
>> Mobil: 0173/6322621
>> Twitter: http://twitter.com/Manuel_B
>> 

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B


Re: UUID based user IDs

Posted by Sean Owen <sr...@gmail.com>.
It operates on Strings. You can use it by calling UUID.toString(). It's
going to be more efficient to clone it to make your own custom version of
it that doesn't convert to string, hash a string every time.

On Thu, Aug 2, 2012 at 3:40 AM, Matt Mitchell <go...@gmail.com> wrote:

> Thanks Manuel, that's very helpful. So you're saying I can just use
> MemoryIDMigrator, even after my preferences have bee created with UUID
> values? Or, should I create my preferences using the MemoryIDMigrator?
>
>

Re: UUID based user IDs

Posted by Matt Mitchell <go...@gmail.com>.
Thanks Manuel, that's very helpful. So you're saying I can just use
MemoryIDMigrator, even after my preferences have bee created with UUID
values? Or, should I create my preferences using the MemoryIDMigrator?

- Matt


On Wed, Aug 1, 2012 at 8:49 PM, Manuel Blechschmidt
<Ma...@gmx.de> wrote:
> Hello Matt,
>
> On 01.08.2012, at 22:40, Matt Mitchell wrote:
>
>> Thanks Sean! That all makes sense. Would you mind recommended a
>> hashing function for this? Is there something in Mahout I could use?
>
> The following class uses an string to long mapping based on a MemoryIDMigrator:
>
> https://github.com/ManuelB/facebook-recommender-demo/blob/master/src/main/java/de/apaxo/bedcon/FacebookRecommender.java
>
> Internally mahout uses parts of the md5 hashes. Which can be fir example directly expressed in SQL:
>
> cast(conv(substring(md5([column name]), 1, 16),16,10) as signed)
>
> Javadoc can be found here:
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/model/IDMigrator.html
>
> /Manuel
>
>>
>> - Matt
>>
>> On Wed, Aug 1, 2012 at 4:34 PM, Sean Owen <sr...@gmail.com> wrote:
>>> Yep, just hash to a long, from UUID or String or whatever. The occasional
>>> collision does not cause a real problem. If you mix the tastes of two users
>>> or items once in a billion times, the overall results will hardly be
>>> different.
>>>
>>> You have to maintain the reverse mapping of course. Look at the IDMigrator
>>> class for a little help there.
>>>
>>> You can rewrite to use UUID or String, but believe me, it will be an
>>> immense amount of change and make things much slower. It used to work this
>>> way for recommenders in about 2006 and the Object overhead and GC pressure
>>> was by far the bottleneck. That's why it's all long now.
>>>
>>> On Wed, Aug 1, 2012 at 9:29 PM, Matt Mitchell <go...@gmail.com> wrote:
>>>
>>>> Question about dealing with UUIDs as Mahout user IDs. I'm considering
>>>> ways to deal with these values:
>>>>
>>>> 1. use getLeastSignificantBits
>>>> 2. re-map to a database auto-increment number (this would take very
>>>> long time to do?)
>>>> 3. customize mahout so that it accepts UUIDs as user IDs
>>>>
>>>> Any feedback here? If I went with #3 (seems the safest) how would I do
>>>> this and, what are the consequences?
>>>>
>>>> The user count is in the millions.
>>>>
>>>> Thanks!
>>>>
>
> --
> Manuel Blechschmidt
> M.Sc. IT Systems Engineering
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B
>

Re: UUID based user IDs

Posted by Manuel Blechschmidt <Ma...@gmx.de>.
Hello Matt,

On 01.08.2012, at 22:40, Matt Mitchell wrote:

> Thanks Sean! That all makes sense. Would you mind recommended a
> hashing function for this? Is there something in Mahout I could use?

The following class uses an string to long mapping based on a MemoryIDMigrator:

https://github.com/ManuelB/facebook-recommender-demo/blob/master/src/main/java/de/apaxo/bedcon/FacebookRecommender.java

Internally mahout uses parts of the md5 hashes. Which can be fir example directly expressed in SQL:

cast(conv(substring(md5([column name]), 1, 16),16,10) as signed)

Javadoc can be found here:
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/model/IDMigrator.html

/Manuel

> 
> - Matt
> 
> On Wed, Aug 1, 2012 at 4:34 PM, Sean Owen <sr...@gmail.com> wrote:
>> Yep, just hash to a long, from UUID or String or whatever. The occasional
>> collision does not cause a real problem. If you mix the tastes of two users
>> or items once in a billion times, the overall results will hardly be
>> different.
>> 
>> You have to maintain the reverse mapping of course. Look at the IDMigrator
>> class for a little help there.
>> 
>> You can rewrite to use UUID or String, but believe me, it will be an
>> immense amount of change and make things much slower. It used to work this
>> way for recommenders in about 2006 and the Object overhead and GC pressure
>> was by far the bottleneck. That's why it's all long now.
>> 
>> On Wed, Aug 1, 2012 at 9:29 PM, Matt Mitchell <go...@gmail.com> wrote:
>> 
>>> Question about dealing with UUIDs as Mahout user IDs. I'm considering
>>> ways to deal with these values:
>>> 
>>> 1. use getLeastSignificantBits
>>> 2. re-map to a database auto-increment number (this would take very
>>> long time to do?)
>>> 3. customize mahout so that it accepts UUIDs as user IDs
>>> 
>>> Any feedback here? If I went with #3 (seems the safest) how would I do
>>> this and, what are the consequences?
>>> 
>>> The user count is in the millions.
>>> 
>>> Thanks!
>>> 

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B


Re: UUID based user IDs

Posted by Sean Owen <sr...@gmail.com>.
No, but I'd recommend XORing the top 64 bits with the bottom 64 bits,
something simple like that.

On Wed, Aug 1, 2012 at 9:40 PM, Matt Mitchell <go...@gmail.com> wrote:

> Thanks Sean! That all makes sense. Would you mind recommended a
> hashing function for this? Is there something in Mahout I could use?
>
>

Re: UUID based user IDs

Posted by Matt Mitchell <go...@gmail.com>.
Thanks Sean! That all makes sense. Would you mind recommended a
hashing function for this? Is there something in Mahout I could use?

- Matt

On Wed, Aug 1, 2012 at 4:34 PM, Sean Owen <sr...@gmail.com> wrote:
> Yep, just hash to a long, from UUID or String or whatever. The occasional
> collision does not cause a real problem. If you mix the tastes of two users
> or items once in a billion times, the overall results will hardly be
> different.
>
> You have to maintain the reverse mapping of course. Look at the IDMigrator
> class for a little help there.
>
> You can rewrite to use UUID or String, but believe me, it will be an
> immense amount of change and make things much slower. It used to work this
> way for recommenders in about 2006 and the Object overhead and GC pressure
> was by far the bottleneck. That's why it's all long now.
>
> On Wed, Aug 1, 2012 at 9:29 PM, Matt Mitchell <go...@gmail.com> wrote:
>
>> Question about dealing with UUIDs as Mahout user IDs. I'm considering
>> ways to deal with these values:
>>
>> 1. use getLeastSignificantBits
>> 2. re-map to a database auto-increment number (this would take very
>> long time to do?)
>> 3. customize mahout so that it accepts UUIDs as user IDs
>>
>> Any feedback here? If I went with #3 (seems the safest) how would I do
>> this and, what are the consequences?
>>
>> The user count is in the millions.
>>
>> Thanks!
>>

Re: UUID based user IDs

Posted by Sean Owen <sr...@gmail.com>.
Yep, just hash to a long, from UUID or String or whatever. The occasional
collision does not cause a real problem. If you mix the tastes of two users
or items once in a billion times, the overall results will hardly be
different.

You have to maintain the reverse mapping of course. Look at the IDMigrator
class for a little help there.

You can rewrite to use UUID or String, but believe me, it will be an
immense amount of change and make things much slower. It used to work this
way for recommenders in about 2006 and the Object overhead and GC pressure
was by far the bottleneck. That's why it's all long now.

On Wed, Aug 1, 2012 at 9:29 PM, Matt Mitchell <go...@gmail.com> wrote:

> Question about dealing with UUIDs as Mahout user IDs. I'm considering
> ways to deal with these values:
>
> 1. use getLeastSignificantBits
> 2. re-map to a database auto-increment number (this would take very
> long time to do?)
> 3. customize mahout so that it accepts UUIDs as user IDs
>
> Any feedback here? If I went with #3 (seems the safest) how would I do
> this and, what are the consequences?
>
> The user count is in the millions.
>
> Thanks!
>