You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Andrew Musselman <an...@gmail.com> on 2019/04/18 23:45:27 UTC

Re: How to convert a unique text value to a unique long value for a large data set

Ramu, sorry for the belated response but if you're still interested you may
want to try the new version of item similarity, which is described some in
this article:
https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

Best
Andrew

On Thu, Sep 20, 2018 at 5:10 AM Ramu Ramaiah <ra...@gmail.com> wrote:

> Hi,
> I am using the Apache Mahout's
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> with the input options
>
> 1. --booleanData
> 2. --similarityClassname SIMILARITY_LOGLIKELIHOOD
>
> The loglikelihood similarity algorithm expects a numeric input. However, I
> have a textual data. One of the things, I did was to write a trivial
> standalone java program to convert the unique text value to a unique long
> value, which does the following.
>
> 1. Maintain a Map such that key is the unique text value and the value is
> the unique long value. Map<String, Long>.
> 2. Before we insert the key, we can lookup the Map, if a key-exists, do not
> create a new Long value. If a key does not exist, increment the counter
> value and insert it to the Map.
>
> However, for large data sets, this may have a limitation since the map size
> grows with the number of unique text values.
>
> There are couple of ways to do this
>
> 1. Create a database table, with a constraint of unique text value ( a
> primary key). Query the table before inserting a new long value. I am
> guessing, this may be slow.
> 2. Whatever, hashing algorithm that I may chose, there's a possibility of
> collision and there's no guarantee for a unique long value for a given
> unique text value.
>
> Are there any better ways to solve this for a large data set?
>
> Thanks,
> Ramu
>