You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Debasish Das <de...@gmail.com> on 2014/09/06 03:14:07 UTC

Re: Huge matrix

Hi Reza,

Have you compared with the brute force algorithm for similarity computation
with something like the following in Spark ?

https://github.com/echen/scaldingale

I am adding cosine similarity computation but I do want to compute an all
pair similarities...

Note that the data is sparse for me (the data that goes to matrix
factorization) so I don't think joining and group-by on (product,product)
will be a big issue for me...

Does it make sense to add all pair similarities as well with dimsum based
similarity ?

Thanks.
Deb






On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <re...@databricks.com> wrote:

> Hi Xiaoli,
>
> There is a PR currently in progress to allow this, via the sampling scheme
> described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>
> The PR is at https://github.com/apache/spark/pull/336 though it will need
> refactoring given the recent changes to matrix interface in MLlib. You may
> implement the sampling scheme for your own app since it's much code.
>
> Best,
> Reza
>
>
> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <li...@gmail.com>
> wrote:
>
>> Hi Andrew,
>>
>> Thanks for your suggestion. I have tried the method. I used 8 nodes and
>> every node has 8G memory. The program just stopped at a stage for about
>> several hours without any further information. Maybe I need to find
>> out a more efficient way.
>>
>>
>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <an...@andrewash.com> wrote:
>>
>>> The naive way would be to put all the users and their attributes into an
>>> RDD, then cartesian product that with itself.  Run the similarity score on
>>> every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and
>>> take the .top(k) for each user.
>>>
>>> I doubt that you'll be able to take this approach with the 1T pairs
>>> though, so it might be worth looking at the literature for recommender
>>> systems to see what else is out there.
>>>
>>>
>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <li...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am implementing an algorithm using Spark. I have one million users. I
>>>> need to compute the similarity between each pair of users using some user's
>>>> attributes.  For each user, I need to get top k most similar users. What is
>>>> the best way to implement this?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>
>

Re: Huge matrix

Posted by Debasish Das <de...@gmail.com>.

I am still a bit confused whether numbers like these can be aggregated as
double:

iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j))

It should be aggregated using something like List[iVal*jVal, colMags(i),
colMags(j)]

I am not sure Algebird can aggregate deterministically over Double...







On Thu, Sep 18, 2014 at 2:08 PM, Reza Zadeh <re...@databricks.com> wrote:

> Hi Deb,
> I am currently seeding the algorithm to be pseudo-random, this is an issue
> being addressed in the PR. If you pull the current version it will be
> deterministic, but not potentially not pseudo-random. The PR will updated
> today.
> Best,
> Reza
>
> On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Hi Reza,
>>
>> Have you tested if different runs of the algorithm produce different
>> similarities (basically if the algorithm is deterministic) ?
>>
>> This number does not look like a Monoid aggregation...iVal * jVal /
>> (math.min(sg, colMags(i)) * math.min(sg, colMags(j))
>>
>> I am noticing some weird behavior as different runs are changing the
>> results...
>>
>> Also can columnMagnitudes produce non-deterministic results ?
>>
>> Thanks.
>>
>> Deb
>>
>> On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh <re...@databricks.com> wrote:
>>
>>> Hi Deb,
>>>
>>> I am not templating RowMatrix/CoordinateMatrix since that would be a big
>>> deviation from the PR. We can add jaccard and other similarity measures in
>>> later PRs.
>>>
>>> In the meantime, you can un-normalize the cosine similarities to get the
>>> dot product, and then compute the other similarity measures from the dot
>>> product.
>>>
>>> Best,
>>> Reza
>>>
>>>
>>> On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> Hi Reza,
>>>>
>>>> In similarColumns, it seems with cosine similarity I also need other
>>>> numbers such as intersection, jaccard and other measures...
>>>>
>>>> Right now I modified the code to generate jaccard but I had to run it
>>>> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
>>>> modify RowMatrix and CoordinateMatrix to be templated on the value...
>>>>
>>>> Are you considering this in your design ?
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>>
>>>> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <re...@databricks.com> wrote:
>>>>
>>>>> Better to do it in a PR of your own, it's not sufficiently related to
>>>>> dimsum
>>>>>
>>>>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.das83@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Cool...can I add loadRowMatrix in your PR ?
>>>>>>
>>>>>> Thanks.
>>>>>> Deb
>>>>>>
>>>>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <re...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Deb,
>>>>>>>
>>>>>>> Did you mean to message me instead of Xiangrui?
>>>>>>>
>>>>>>> For TS matrices, dimsum with positiveinfinity and computeGramian
>>>>>>> have the same cost, so you can do either one. For dense matrices with say,
>>>>>>> 1m columns this won't be computationally feasible and you'll want to start
>>>>>>> sampling with dimsum.
>>>>>>>
>>>>>>> It would be helpful to have a loadRowMatrix function, I would use it.
>>>>>>>
>>>>>>> Best,
>>>>>>> Reza
>>>>>>>
>>>>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <
>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Xiangrui,
>>>>>>>>
>>>>>>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>>>>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>>>>>>> computation as well...
>>>>>>>>
>>>>>>>> Do you recommend using this approach for tall skinny matrices or
>>>>>>>> just use the dimsum's routines ?
>>>>>>>>
>>>>>>>> Right now RowMatrix does not have a loadRowMatrix function like the
>>>>>>>> one available in LabeledPoint...should I add one ? I want to export the
>>>>>>>> matrix out from my stable code and then test dimsum...
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> Deb
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>>>>>>> probably still for 1.2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <
>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Awesome...Let me try it out...
>>>>>>>>>>
>>>>>>>>>> Any plans of putting other similarity measures in future (jaccard
>>>>>>>>>> is something that will be useful) ? I guess it makes sense to add some
>>>>>>>>>> similarity measures in mllib...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity
>>>>>>>>>>> turns it into the usual brute force algorithm for cosine similarity, there
>>>>>>>>>>> is no sampling. This is by design.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is
>>>>>>>>>>>> generating the brute force...
>>>>>>>>>>>>
>>>>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce
>>>>>>>>>>>> the exact same result as doing catesian products of RDD[(product, vector)]
>>>>>>>>>>>> and computing similarities or there will be some approximation ?
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry I have not read your paper yet. Will read it over the
>>>>>>>>>>>> weekend.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <reza@databricks.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be
>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>>>>>>>>> threshold.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dimensionality reduction should help, and how effective it is
>>>>>>>>>>>>> will depend on your application and domain, it's worth trying if the direct
>>>>>>>>>>>>> computation doesn't work.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>>>>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>>>>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am
>>>>>>>>>>>>>> considering running a matrix factorization to reduce the dimension to say
>>>>>>>>>>>>>> ~60M x 50 and then run all pair similarity...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where
>>>>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>>>>>>> fine...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum
>>>>>>>>>>>>>>> handle it ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <
>>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You might want to wait until Wednesday since the interface
>>>>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over the weekend, so
>>>>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need it before a
>>>>>>>>>>>>>>>> week.
>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR
>>>>>>>>>>>>>>>>> ? Let me pull it in and test on our dataset...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <
>>>>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via
>>>>>>>>>>>>>>>>>> dimsum in this PR:
>>>>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>>>>>>> similarity computation with something like the following in Spark ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want
>>>>>>>>>>>>>>>>>>> to compute an all pair similarities...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes
>>>>>>>>>>>>>>>>>>> to matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well
>>>>>>>>>>>>>>>>>>> with dimsum based similarity ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via
>>>>>>>>>>>>>>>>>>>> the sampling scheme described in this paper:
>>>>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to matrix
>>>>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for your own app
>>>>>>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I
>>>>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just stopped at a
>>>>>>>>>>>>>>>>>>>>> stage for about several hours without any further information. Maybe I need
>>>>>>>>>>>>>>>>>>>>> to find
>>>>>>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach
>>>>>>>>>>>>>>>>>>>>>> with the 1T pairs though, so it might be worth looking at the literature
>>>>>>>>>>>>>>>>>>>>>> for recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have
>>>>>>>>>>>>>>>>>>>>>>> one million users. I need to compute the similarity between each pair of
>>>>>>>>>>>>>>>>>>>>>>> users using some user's attributes.  For each user, I need to get top k
>>>>>>>>>>>>>>>>>>>>>>> most similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Reza Zadeh <re...@databricks.com>.

Hi Deb,
I am currently seeding the algorithm to be pseudo-random, this is an issue
being addressed in the PR. If you pull the current version it will be
deterministic, but not potentially not pseudo-random. The PR will updated
today.
Best,
Reza

On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das <de...@gmail.com>
wrote:

> Hi Reza,
>
> Have you tested if different runs of the algorithm produce different
> similarities (basically if the algorithm is deterministic) ?
>
> This number does not look like a Monoid aggregation...iVal * jVal /
> (math.min(sg, colMags(i)) * math.min(sg, colMags(j))
>
> I am noticing some weird behavior as different runs are changing the
> results...
>
> Also can columnMagnitudes produce non-deterministic results ?
>
> Thanks.
>
> Deb
>
> On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh <re...@databricks.com> wrote:
>
>> Hi Deb,
>>
>> I am not templating RowMatrix/CoordinateMatrix since that would be a big
>> deviation from the PR. We can add jaccard and other similarity measures in
>> later PRs.
>>
>> In the meantime, you can un-normalize the cosine similarities to get the
>> dot product, and then compute the other similarity measures from the dot
>> product.
>>
>> Best,
>> Reza
>>
>>
>> On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Hi Reza,
>>>
>>> In similarColumns, it seems with cosine similarity I also need other
>>> numbers such as intersection, jaccard and other measures...
>>>
>>> Right now I modified the code to generate jaccard but I had to run it
>>> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
>>> modify RowMatrix and CoordinateMatrix to be templated on the value...
>>>
>>> Are you considering this in your design ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <re...@databricks.com> wrote:
>>>
>>>> Better to do it in a PR of your own, it's not sufficiently related to
>>>> dimsum
>>>>
>>>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> Cool...can I add loadRowMatrix in your PR ?
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <re...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Deb,
>>>>>>
>>>>>> Did you mean to message me instead of Xiangrui?
>>>>>>
>>>>>> For TS matrices, dimsum with positiveinfinity and computeGramian have
>>>>>> the same cost, so you can do either one. For dense matrices with say, 1m
>>>>>> columns this won't be computationally feasible and you'll want to start
>>>>>> sampling with dimsum.
>>>>>>
>>>>>> It would be helpful to have a loadRowMatrix function, I would use it.
>>>>>>
>>>>>> Best,
>>>>>> Reza
>>>>>>
>>>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <
>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Xiangrui,
>>>>>>>
>>>>>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>>>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>>>>>> computation as well...
>>>>>>>
>>>>>>> Do you recommend using this approach for tall skinny matrices or
>>>>>>> just use the dimsum's routines ?
>>>>>>>
>>>>>>> Right now RowMatrix does not have a loadRowMatrix function like the
>>>>>>> one available in LabeledPoint...should I add one ? I want to export the
>>>>>>> matrix out from my stable code and then test dimsum...
>>>>>>>
>>>>>>> Thanks.
>>>>>>> Deb
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <re...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>>>>>> probably still for 1.2
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <
>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Awesome...Let me try it out...
>>>>>>>>>
>>>>>>>>> Any plans of putting other similarity measures in future (jaccard
>>>>>>>>> is something that will be useful) ? I guess it makes sense to add some
>>>>>>>>> similarity measures in mllib...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity
>>>>>>>>>> turns it into the usual brute force algorithm for cosine similarity, there
>>>>>>>>>> is no sampling. This is by design.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is
>>>>>>>>>>> generating the brute force...
>>>>>>>>>>>
>>>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>>>>>>> exact same result as doing catesian products of RDD[(product, vector)] and
>>>>>>>>>>> computing similarities or there will be some approximation ?
>>>>>>>>>>>
>>>>>>>>>>> Sorry I have not read your paper yet. Will read it over the
>>>>>>>>>>> weekend.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be
>>>>>>>>>>>> fine.
>>>>>>>>>>>>
>>>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>>>>>>>> threshold.
>>>>>>>>>>>>
>>>>>>>>>>>> Dimensionality reduction should help, and how effective it is
>>>>>>>>>>>> will depend on your application and domain, it's worth trying if the direct
>>>>>>>>>>>> computation doesn't work.
>>>>>>>>>>>>
>>>>>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>>>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>>>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am
>>>>>>>>>>>>> considering running a matrix factorization to reduce the dimension to say
>>>>>>>>>>>>> ~60M x 50 and then run all pair similarity...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where
>>>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>>>>>> fine...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum
>>>>>>>>>>>>>> handle it ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <
>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You might want to wait until Wednesday since the interface
>>>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over the weekend, so
>>>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need it before a
>>>>>>>>>>>>>>> week.
>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ?
>>>>>>>>>>>>>>>> Let me pull it in and test on our dataset...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <
>>>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via
>>>>>>>>>>>>>>>>> dimsum in this PR:
>>>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>>>>>> similarity computation with something like the following in Spark ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want
>>>>>>>>>>>>>>>>>> to compute an all pair similarities...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes
>>>>>>>>>>>>>>>>>> to matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well
>>>>>>>>>>>>>>>>>> with dimsum based similarity ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via
>>>>>>>>>>>>>>>>>>> the sampling scheme described in this paper:
>>>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to matrix
>>>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for your own app
>>>>>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I
>>>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just stopped at a
>>>>>>>>>>>>>>>>>>>> stage for about several hours without any further information. Maybe I need
>>>>>>>>>>>>>>>>>>>> to find
>>>>>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with
>>>>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have
>>>>>>>>>>>>>>>>>>>>>> one million users. I need to compute the similarity between each pair of
>>>>>>>>>>>>>>>>>>>>>> users using some user's attributes.  For each user, I need to get top k
>>>>>>>>>>>>>>>>>>>>>> most similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Debasish Das <de...@gmail.com>.

Hi Reza,

Have you tested if different runs of the algorithm produce different
similarities (basically if the algorithm is deterministic) ?

This number does not look like a Monoid aggregation...iVal * jVal /
(math.min(sg, colMags(i)) * math.min(sg, colMags(j))

I am noticing some weird behavior as different runs are changing the
results...

Also can columnMagnitudes produce non-deterministic results ?

Thanks.

Deb

On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh <re...@databricks.com> wrote:

> Hi Deb,
>
> I am not templating RowMatrix/CoordinateMatrix since that would be a big
> deviation from the PR. We can add jaccard and other similarity measures in
> later PRs.
>
> In the meantime, you can un-normalize the cosine similarities to get the
> dot product, and then compute the other similarity measures from the dot
> product.
>
> Best,
> Reza
>
>
> On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Hi Reza,
>>
>> In similarColumns, it seems with cosine similarity I also need other
>> numbers such as intersection, jaccard and other measures...
>>
>> Right now I modified the code to generate jaccard but I had to run it
>> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
>> modify RowMatrix and CoordinateMatrix to be templated on the value...
>>
>> Are you considering this in your design ?
>>
>> Thanks.
>> Deb
>>
>>
>> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <re...@databricks.com> wrote:
>>
>>> Better to do it in a PR of your own, it's not sufficiently related to
>>> dimsum
>>>
>>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> Cool...can I add loadRowMatrix in your PR ?
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <re...@databricks.com> wrote:
>>>>
>>>>> Hi Deb,
>>>>>
>>>>> Did you mean to message me instead of Xiangrui?
>>>>>
>>>>> For TS matrices, dimsum with positiveinfinity and computeGramian have
>>>>> the same cost, so you can do either one. For dense matrices with say, 1m
>>>>> columns this won't be computationally feasible and you'll want to start
>>>>> sampling with dimsum.
>>>>>
>>>>> It would be helpful to have a loadRowMatrix function, I would use it.
>>>>>
>>>>> Best,
>>>>> Reza
>>>>>
>>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <
>>>>> debasish.das83@gmail.com> wrote:
>>>>>
>>>>>> Hi Xiangrui,
>>>>>>
>>>>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>>>>> computation as well...
>>>>>>
>>>>>> Do you recommend using this approach for tall skinny matrices or just
>>>>>> use the dimsum's routines ?
>>>>>>
>>>>>> Right now RowMatrix does not have a loadRowMatrix function like the
>>>>>> one available in LabeledPoint...should I add one ? I want to export the
>>>>>> matrix out from my stable code and then test dimsum...
>>>>>>
>>>>>> Thanks.
>>>>>> Deb
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <re...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>>>>> probably still for 1.2
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <
>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>
>>>>>>>> Awesome...Let me try it out...
>>>>>>>>
>>>>>>>> Any plans of putting other similarity measures in future (jaccard
>>>>>>>> is something that will be useful) ? I guess it makes sense to add some
>>>>>>>> similarity measures in mllib...
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity
>>>>>>>>> turns it into the usual brute force algorithm for cosine similarity, there
>>>>>>>>> is no sampling. This is by design.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating
>>>>>>>>>> the brute force...
>>>>>>>>>>
>>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>>>>>> exact same result as doing catesian products of RDD[(product, vector)] and
>>>>>>>>>> computing similarities or there will be some approximation ?
>>>>>>>>>>
>>>>>>>>>> Sorry I have not read your paper yet. Will read it over the
>>>>>>>>>> weekend.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>>>>>>
>>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>>>>>>> threshold.
>>>>>>>>>>>
>>>>>>>>>>> Dimensionality reduction should help, and how effective it is
>>>>>>>>>>> will depend on your application and domain, it's worth trying if the direct
>>>>>>>>>>> computation doesn't work.
>>>>>>>>>>>
>>>>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am
>>>>>>>>>>>> considering running a matrix factorization to reduce the dimension to say
>>>>>>>>>>>> ~60M x 50 and then run all pair similarity...
>>>>>>>>>>>>
>>>>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where
>>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>>>>> fine...
>>>>>>>>>>>>>
>>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle
>>>>>>>>>>>>> it ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <
>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> You might want to wait until Wednesday since the interface
>>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over the weekend, so
>>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need it before a
>>>>>>>>>>>>>> week.
>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ?
>>>>>>>>>>>>>>> Let me pull it in and test on our dataset...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <
>>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via
>>>>>>>>>>>>>>>> dimsum in this PR:
>>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>>>>> similarity computation with something like the following in Spark ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well
>>>>>>>>>>>>>>>>> with dimsum based similarity ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via
>>>>>>>>>>>>>>>>>> the sampling scheme described in this paper:
>>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to matrix
>>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for your own app
>>>>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I
>>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just stopped at a
>>>>>>>>>>>>>>>>>>> stage for about several hours without any further information. Maybe I need
>>>>>>>>>>>>>>>>>>> to find
>>>>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with
>>>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between each pair of users
>>>>>>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get top k most
>>>>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Debasish Das <de...@gmail.com>.

Yup that's what I did for now...

On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh <re...@databricks.com> wrote:

> Hi Deb,
>
> I am not templating RowMatrix/CoordinateMatrix since that would be a big
> deviation from the PR. We can add jaccard and other similarity measures in
> later PRs.
>
> In the meantime, you can un-normalize the cosine similarities to get the
> dot product, and then compute the other similarity measures from the dot
> product.
>
> Best,
> Reza
>
>
> On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Hi Reza,
>>
>> In similarColumns, it seems with cosine similarity I also need other
>> numbers such as intersection, jaccard and other measures...
>>
>> Right now I modified the code to generate jaccard but I had to run it
>> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
>> modify RowMatrix and CoordinateMatrix to be templated on the value...
>>
>> Are you considering this in your design ?
>>
>> Thanks.
>> Deb
>>
>>
>> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <re...@databricks.com> wrote:
>>
>>> Better to do it in a PR of your own, it's not sufficiently related to
>>> dimsum
>>>
>>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> Cool...can I add loadRowMatrix in your PR ?
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <re...@databricks.com> wrote:
>>>>
>>>>> Hi Deb,
>>>>>
>>>>> Did you mean to message me instead of Xiangrui?
>>>>>
>>>>> For TS matrices, dimsum with positiveinfinity and computeGramian have
>>>>> the same cost, so you can do either one. For dense matrices with say, 1m
>>>>> columns this won't be computationally feasible and you'll want to start
>>>>> sampling with dimsum.
>>>>>
>>>>> It would be helpful to have a loadRowMatrix function, I would use it.
>>>>>
>>>>> Best,
>>>>> Reza
>>>>>
>>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <
>>>>> debasish.das83@gmail.com> wrote:
>>>>>
>>>>>> Hi Xiangrui,
>>>>>>
>>>>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>>>>> computation as well...
>>>>>>
>>>>>> Do you recommend using this approach for tall skinny matrices or just
>>>>>> use the dimsum's routines ?
>>>>>>
>>>>>> Right now RowMatrix does not have a loadRowMatrix function like the
>>>>>> one available in LabeledPoint...should I add one ? I want to export the
>>>>>> matrix out from my stable code and then test dimsum...
>>>>>>
>>>>>> Thanks.
>>>>>> Deb
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <re...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>>>>> probably still for 1.2
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <
>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>
>>>>>>>> Awesome...Let me try it out...
>>>>>>>>
>>>>>>>> Any plans of putting other similarity measures in future (jaccard
>>>>>>>> is something that will be useful) ? I guess it makes sense to add some
>>>>>>>> similarity measures in mllib...
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity
>>>>>>>>> turns it into the usual brute force algorithm for cosine similarity, there
>>>>>>>>> is no sampling. This is by design.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating
>>>>>>>>>> the brute force...
>>>>>>>>>>
>>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>>>>>> exact same result as doing catesian products of RDD[(product, vector)] and
>>>>>>>>>> computing similarities or there will be some approximation ?
>>>>>>>>>>
>>>>>>>>>> Sorry I have not read your paper yet. Will read it over the
>>>>>>>>>> weekend.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>>>>>>
>>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>>>>>>> threshold.
>>>>>>>>>>>
>>>>>>>>>>> Dimensionality reduction should help, and how effective it is
>>>>>>>>>>> will depend on your application and domain, it's worth trying if the direct
>>>>>>>>>>> computation doesn't work.
>>>>>>>>>>>
>>>>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am
>>>>>>>>>>>> considering running a matrix factorization to reduce the dimension to say
>>>>>>>>>>>> ~60M x 50 and then run all pair similarity...
>>>>>>>>>>>>
>>>>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where
>>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>>>>> fine...
>>>>>>>>>>>>>
>>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle
>>>>>>>>>>>>> it ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <
>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> You might want to wait until Wednesday since the interface
>>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over the weekend, so
>>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need it before a
>>>>>>>>>>>>>> week.
>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ?
>>>>>>>>>>>>>>> Let me pull it in and test on our dataset...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <
>>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via
>>>>>>>>>>>>>>>> dimsum in this PR:
>>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>>>>> similarity computation with something like the following in Spark ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well
>>>>>>>>>>>>>>>>> with dimsum based similarity ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via
>>>>>>>>>>>>>>>>>> the sampling scheme described in this paper:
>>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to matrix
>>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for your own app
>>>>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I
>>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just stopped at a
>>>>>>>>>>>>>>>>>>> stage for about several hours without any further information. Maybe I need
>>>>>>>>>>>>>>>>>>> to find
>>>>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with
>>>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between each pair of users
>>>>>>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get top k most
>>>>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Reza Zadeh <re...@databricks.com>.

Hi Deb,

I am not templating RowMatrix/CoordinateMatrix since that would be a big
deviation from the PR. We can add jaccard and other similarity measures in
later PRs.

In the meantime, you can un-normalize the cosine similarities to get the
dot product, and then compute the other similarity measures from the dot
product.

Best,
Reza


On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <de...@gmail.com>
wrote:

> Hi Reza,
>
> In similarColumns, it seems with cosine similarity I also need other
> numbers such as intersection, jaccard and other measures...
>
> Right now I modified the code to generate jaccard but I had to run it
> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
> modify RowMatrix and CoordinateMatrix to be templated on the value...
>
> Are you considering this in your design ?
>
> Thanks.
> Deb
>
>
> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <re...@databricks.com> wrote:
>
>> Better to do it in a PR of your own, it's not sufficiently related to
>> dimsum
>>
>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Cool...can I add loadRowMatrix in your PR ?
>>>
>>> Thanks.
>>> Deb
>>>
>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <re...@databricks.com> wrote:
>>>
>>>> Hi Deb,
>>>>
>>>> Did you mean to message me instead of Xiangrui?
>>>>
>>>> For TS matrices, dimsum with positiveinfinity and computeGramian have
>>>> the same cost, so you can do either one. For dense matrices with say, 1m
>>>> columns this won't be computationally feasible and you'll want to start
>>>> sampling with dimsum.
>>>>
>>>> It would be helpful to have a loadRowMatrix function, I would use it.
>>>>
>>>> Best,
>>>> Reza
>>>>
>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <debasish.das83@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Xiangrui,
>>>>>
>>>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>>>> computation as well...
>>>>>
>>>>> Do you recommend using this approach for tall skinny matrices or just
>>>>> use the dimsum's routines ?
>>>>>
>>>>> Right now RowMatrix does not have a loadRowMatrix function like the
>>>>> one available in LabeledPoint...should I add one ? I want to export the
>>>>> matrix out from my stable code and then test dimsum...
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <re...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>>>> probably still for 1.2
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <
>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>
>>>>>>> Awesome...Let me try it out...
>>>>>>>
>>>>>>> Any plans of putting other similarity measures in future (jaccard is
>>>>>>> something that will be useful) ? I guess it makes sense to add some
>>>>>>> similarity measures in mllib...
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity
>>>>>>>> turns it into the usual brute force algorithm for cosine similarity, there
>>>>>>>> is no sampling. This is by design.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating
>>>>>>>>> the brute force...
>>>>>>>>>
>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>>>>> exact same result as doing catesian products of RDD[(product, vector)] and
>>>>>>>>> computing similarities or there will be some approximation ?
>>>>>>>>>
>>>>>>>>> Sorry I have not read your paper yet. Will read it over the
>>>>>>>>> weekend.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>>>>>
>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>>>>>> threshold.
>>>>>>>>>>
>>>>>>>>>> Dimensionality reduction should help, and how effective it is
>>>>>>>>>> will depend on your application and domain, it's worth trying if the direct
>>>>>>>>>> computation doesn't work.
>>>>>>>>>>
>>>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am
>>>>>>>>>>> considering running a matrix factorization to reduce the dimension to say
>>>>>>>>>>> ~60M x 50 and then run all pair similarity...
>>>>>>>>>>>
>>>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where
>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>>>>
>>>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>>>
>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>>>> fine...
>>>>>>>>>>>>
>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle
>>>>>>>>>>>> it ?
>>>>>>>>>>>>
>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <reza@databricks.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> You might want to wait until Wednesday since the interface
>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over the weekend, so
>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need it before a
>>>>>>>>>>>>> week.
>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ?
>>>>>>>>>>>>>> Let me pull it in and test on our dataset...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <
>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum
>>>>>>>>>>>>>>> in this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>>>> similarity computation with something like the following in Spark ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well
>>>>>>>>>>>>>>>> with dimsum based similarity ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to matrix
>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for your own app
>>>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I
>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just stopped at a
>>>>>>>>>>>>>>>>>> stage for about several hours without any further information. Maybe I need
>>>>>>>>>>>>>>>>>> to find
>>>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with
>>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between each pair of users
>>>>>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get top k most
>>>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Debasish Das <de...@gmail.com>.

Hi Reza,

In similarColumns, it seems with cosine similarity I also need other
numbers such as intersection, jaccard and other measures...

Right now I modified the code to generate jaccard but I had to run it twice
due to the design of RowMatrix / CoordinateMatrix...I feel we should modify
RowMatrix and CoordinateMatrix to be templated on the value...

Are you considering this in your design ?

Thanks.
Deb


On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <re...@databricks.com> wrote:

> Better to do it in a PR of your own, it's not sufficiently related to
> dimsum
>
> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Cool...can I add loadRowMatrix in your PR ?
>>
>> Thanks.
>> Deb
>>
>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <re...@databricks.com> wrote:
>>
>>> Hi Deb,
>>>
>>> Did you mean to message me instead of Xiangrui?
>>>
>>> For TS matrices, dimsum with positiveinfinity and computeGramian have
>>> the same cost, so you can do either one. For dense matrices with say, 1m
>>> columns this won't be computationally feasible and you'll want to start
>>> sampling with dimsum.
>>>
>>> It would be helpful to have a loadRowMatrix function, I would use it.
>>>
>>> Best,
>>> Reza
>>>
>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> Hi Xiangrui,
>>>>
>>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>>> computation as well...
>>>>
>>>> Do you recommend using this approach for tall skinny matrices or just
>>>> use the dimsum's routines ?
>>>>
>>>> Right now RowMatrix does not have a loadRowMatrix function like the one
>>>> available in LabeledPoint...should I add one ? I want to export the matrix
>>>> out from my stable code and then test dimsum...
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <re...@databricks.com> wrote:
>>>>
>>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>>> probably still for 1.2
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <debasish.das83@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Awesome...Let me try it out...
>>>>>>
>>>>>> Any plans of putting other similarity measures in future (jaccard is
>>>>>> something that will be useful) ? I guess it makes sense to add some
>>>>>> similarity measures in mllib...
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity
>>>>>>> turns it into the usual brute force algorithm for cosine similarity, there
>>>>>>> is no sampling. This is by design.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>
>>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating
>>>>>>>> the brute force...
>>>>>>>>
>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>>>> exact same result as doing catesian products of RDD[(product, vector)] and
>>>>>>>> computing similarities or there will be some approximation ?
>>>>>>>>
>>>>>>>> Sorry I have not read your paper yet. Will read it over the weekend.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>>>>
>>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>>>>> threshold.
>>>>>>>>>
>>>>>>>>> Dimensionality reduction should help, and how effective it is will
>>>>>>>>> depend on your application and domain, it's worth trying if the direct
>>>>>>>>> computation doesn't work.
>>>>>>>>>
>>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>>>>>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 and
>>>>>>>>>> then run all pair similarity...
>>>>>>>>>>
>>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows
>>>>>>>>>>> are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>>>
>>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>>
>>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>>> fine...
>>>>>>>>>>>
>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle
>>>>>>>>>>> it ?
>>>>>>>>>>>
>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> You might want to wait until Wednesday since the interface will
>>>>>>>>>>>> be changing in that PR before Wednesday, probably over the weekend, so that
>>>>>>>>>>>> you don't have to redo your code. Your call if you need it before a week.
>>>>>>>>>>>> Reza
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ?
>>>>>>>>>>>>> Let me pull it in and test on our dataset...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <
>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum
>>>>>>>>>>>>>> in this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>>> similarity computation with something like the following in Spark ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>>>>>>>> dimsum based similarity ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to matrix
>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for your own app
>>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I
>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just stopped at a
>>>>>>>>>>>>>>>>> stage for about several hours without any further information. Maybe I need
>>>>>>>>>>>>>>>>> to find
>>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with
>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between each pair of users
>>>>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get top k most
>>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Reza Zadeh <re...@databricks.com>.

Better to do it in a PR of your own, it's not sufficiently related to dimsum

On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <de...@gmail.com>
wrote:

> Cool...can I add loadRowMatrix in your PR ?
>
> Thanks.
> Deb
>
> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <re...@databricks.com> wrote:
>
>> Hi Deb,
>>
>> Did you mean to message me instead of Xiangrui?
>>
>> For TS matrices, dimsum with positiveinfinity and computeGramian have the
>> same cost, so you can do either one. For dense matrices with say, 1m
>> columns this won't be computationally feasible and you'll want to start
>> sampling with dimsum.
>>
>> It would be helpful to have a loadRowMatrix function, I would use it.
>>
>> Best,
>> Reza
>>
>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Hi Xiangrui,
>>>
>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>> computation as well...
>>>
>>> Do you recommend using this approach for tall skinny matrices or just
>>> use the dimsum's routines ?
>>>
>>> Right now RowMatrix does not have a loadRowMatrix function like the one
>>> available in LabeledPoint...should I add one ? I want to export the matrix
>>> out from my stable code and then test dimsum...
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>>
>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <re...@databricks.com> wrote:
>>>
>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>> probably still for 1.2
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> Awesome...Let me try it out...
>>>>>
>>>>> Any plans of putting other similarity measures in future (jaccard is
>>>>> something that will be useful) ? I guess it makes sense to add some
>>>>> similarity measures in mllib...
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns
>>>>>> it into the usual brute force algorithm for cosine similarity, there is no
>>>>>> sampling. This is by design.
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>
>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating
>>>>>>> the brute force...
>>>>>>>
>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>>> exact same result as doing catesian products of RDD[(product, vector)] and
>>>>>>> computing similarities or there will be some approximation ?
>>>>>>>
>>>>>>> Sorry I have not read your paper yet. Will read it over the weekend.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>>>
>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>>>> threshold.
>>>>>>>>
>>>>>>>> Dimensionality reduction should help, and how effective it is will
>>>>>>>> depend on your application and domain, it's worth trying if the direct
>>>>>>>> computation doesn't work.
>>>>>>>>
>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>>>>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 and
>>>>>>>>> then run all pair similarity...
>>>>>>>>>
>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows
>>>>>>>>>> are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>>
>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>
>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>> fine...
>>>>>>>>>>
>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it
>>>>>>>>>> ?
>>>>>>>>>>
>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> You might want to wait until Wednesday since the interface will
>>>>>>>>>>> be changing in that PR before Wednesday, probably over the weekend, so that
>>>>>>>>>>> you don't have to redo your code. Your call if you need it before a week.
>>>>>>>>>>> Reza
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let
>>>>>>>>>>>> me pull it in and test on our dataset...
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> Deb
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <reza@databricks.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum
>>>>>>>>>>>>> in this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>
>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>> similarity computation with something like the following in Spark ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>>>>>>> dimsum based similarity ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to matrix
>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for your own app
>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used
>>>>>>>>>>>>>>>> 8 nodes and every node has 8G memory. The program just stopped at a stage
>>>>>>>>>>>>>>>> for about several hours without any further information. Maybe I need to
>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the
>>>>>>>>>>>>>>>>> 1T pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between each pair of users
>>>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get top k most
>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Debasish Das <de...@gmail.com>.

Cool...can I add loadRowMatrix in your PR ?

Thanks.
Deb

On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <re...@databricks.com> wrote:

> Hi Deb,
>
> Did you mean to message me instead of Xiangrui?
>
> For TS matrices, dimsum with positiveinfinity and computeGramian have the
> same cost, so you can do either one. For dense matrices with say, 1m
> columns this won't be computationally feasible and you'll want to start
> sampling with dimsum.
>
> It would be helpful to have a loadRowMatrix function, I would use it.
>
> Best,
> Reza
>
> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Hi Xiangrui,
>>
>> For tall skinny matrices, if I can pass a similarityMeasure to
>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>> computation as well...
>>
>> Do you recommend using this approach for tall skinny matrices or just use
>> the dimsum's routines ?
>>
>> Right now RowMatrix does not have a loadRowMatrix function like the one
>> available in LabeledPoint...should I add one ? I want to export the matrix
>> out from my stable code and then test dimsum...
>>
>> Thanks.
>> Deb
>>
>>
>>
>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <re...@databricks.com> wrote:
>>
>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>> probably still for 1.2
>>>
>>>
>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> Awesome...Let me try it out...
>>>>
>>>> Any plans of putting other similarity measures in future (jaccard is
>>>> something that will be useful) ? I guess it makes sense to add some
>>>> similarity measures in mllib...
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com> wrote:
>>>>
>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns
>>>>> it into the usual brute force algorithm for cosine similarity, there is no
>>>>> sampling. This is by design.
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <debasish.das83@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> I looked at the code: similarColumns(Double.posInf) is generating the
>>>>>> brute force...
>>>>>>
>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>> exact same result as doing catesian products of RDD[(product, vector)] and
>>>>>> computing similarities or there will be some approximation ?
>>>>>>
>>>>>> Sorry I have not read your paper yet. Will read it over the weekend.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>>
>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>>> threshold.
>>>>>>>
>>>>>>> Dimensionality reduction should help, and how effective it is will
>>>>>>> depend on your application and domain, it's worth trying if the direct
>>>>>>> computation doesn't work.
>>>>>>>
>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>>>>> instead of all pairs above a threshold.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>
>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>>>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 and
>>>>>>>> then run all pair similarity...
>>>>>>>>
>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows
>>>>>>>>> are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>
>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>
>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>> fine...
>>>>>>>>>
>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>>>>>>>>
>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> You might want to wait until Wednesday since the interface will
>>>>>>>>>> be changing in that PR before Wednesday, probably over the weekend, so that
>>>>>>>>>> you don't have to redo your code. Your call if you need it before a week.
>>>>>>>>>> Reza
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let
>>>>>>>>>>> me pull it in and test on our dataset...
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>> Deb
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>
>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in
>>>>>>>>>>>> this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>
>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Reza
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>> similarity computation with something like the following in Spark ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>>>>>> dimsum based similarity ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though
>>>>>>>>>>>>>> it will need refactoring given the recent changes to matrix interface in
>>>>>>>>>>>>>> MLlib. You may implement the sampling scheme for your own app since it's
>>>>>>>>>>>>>> much code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used
>>>>>>>>>>>>>>> 8 nodes and every node has 8G memory. The program just stopped at a stage
>>>>>>>>>>>>>>> for about several hours without any further information. Maybe I need to
>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the
>>>>>>>>>>>>>>>> 1T pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between each pair of users
>>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get top k most
>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Reza Zadeh <re...@databricks.com>.

Hi Deb,

Did you mean to message me instead of Xiangrui?

For TS matrices, dimsum with positiveinfinity and computeGramian have the
same cost, so you can do either one. For dense matrices with say, 1m
columns this won't be computationally feasible and you'll want to start
sampling with dimsum.

It would be helpful to have a loadRowMatrix function, I would use it.

Best,
Reza

On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <de...@gmail.com>
wrote:

> Hi Xiangrui,
>
> For tall skinny matrices, if I can pass a similarityMeasure to
> computeGrammian, I could re-use the SVD's computeGrammian for similarity
> computation as well...
>
> Do you recommend using this approach for tall skinny matrices or just use
> the dimsum's routines ?
>
> Right now RowMatrix does not have a loadRowMatrix function like the one
> available in LabeledPoint...should I add one ? I want to export the matrix
> out from my stable code and then test dimsum...
>
> Thanks.
> Deb
>
>
>
> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <re...@databricks.com> wrote:
>
>> I will add dice, overlap, and jaccard similarity in a future PR, probably
>> still for 1.2
>>
>>
>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Awesome...Let me try it out...
>>>
>>> Any plans of putting other similarity measures in future (jaccard is
>>> something that will be useful) ? I guess it makes sense to add some
>>> similarity measures in mllib...
>>>
>>>
>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com> wrote:
>>>
>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns
>>>> it into the usual brute force algorithm for cosine similarity, there is no
>>>> sampling. This is by design.
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> I looked at the code: similarColumns(Double.posInf) is generating the
>>>>> brute force...
>>>>>
>>>>> Basically dimsum with gamma as PositiveInfinity will produce the exact
>>>>> same result as doing catesian products of RDD[(product, vector)] and
>>>>> computing similarities or there will be some approximation ?
>>>>>
>>>>> Sorry I have not read your paper yet. Will read it over the weekend.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>
>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>> threshold.
>>>>>>
>>>>>> Dimensionality reduction should help, and how effective it is will
>>>>>> depend on your application and domain, it's worth trying if the direct
>>>>>> computation doesn't work.
>>>>>>
>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>>>> instead of all pairs above a threshold.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>
>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 and
>>>>>>> then run all pair similarity...
>>>>>>>
>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows
>>>>>>>> are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>
>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>
>>>>>>>> I guess for the second one both all pair and dimsum will run fine...
>>>>>>>>
>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>>>>>>>
>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> You might want to wait until Wednesday since the interface will be
>>>>>>>>> changing in that PR before Wednesday, probably over the weekend, so that
>>>>>>>>> you don't have to redo your code. Your call if you need it before a week.
>>>>>>>>> Reza
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let
>>>>>>>>>> me pull it in and test on our dataset...
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>> Deb
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>
>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in
>>>>>>>>>>> this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>>>
>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Reza
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>
>>>>>>>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>>>>>>>> computation with something like the following in Spark ?
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>
>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>
>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>
>>>>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>>>>> dimsum based similarity ?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> Deb
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>> reza@databricks.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>
>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though
>>>>>>>>>>>>> it will need refactoring given the recent changes to matrix interface in
>>>>>>>>>>>>> MLlib. You may implement the sampling scheme for your own app since it's
>>>>>>>>>>>>> much code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8
>>>>>>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a stage for
>>>>>>>>>>>>>> about several hours without any further information. Maybe I need to find
>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the
>>>>>>>>>>>>>>> 1T pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>> million users. I need to compute the similarity between each pair of users
>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get top k most
>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Debasish Das <de...@gmail.com>.

Hi Xiangrui,

For tall skinny matrices, if I can pass a similarityMeasure to
computeGrammian, I could re-use the SVD's computeGrammian for similarity
computation as well...

Do you recommend using this approach for tall skinny matrices or just use
the dimsum's routines ?

Right now RowMatrix does not have a loadRowMatrix function like the one
available in LabeledPoint...should I add one ? I want to export the matrix
out from my stable code and then test dimsum...

Thanks.
Deb



On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <re...@databricks.com> wrote:

> I will add dice, overlap, and jaccard similarity in a future PR, probably
> still for 1.2
>
>
> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Awesome...Let me try it out...
>>
>> Any plans of putting other similarity measures in future (jaccard is
>> something that will be useful) ? I guess it makes sense to add some
>> similarity measures in mllib...
>>
>>
>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com> wrote:
>>
>>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns it
>>> into the usual brute force algorithm for cosine similarity, there is no
>>> sampling. This is by design.
>>>
>>>
>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> I looked at the code: similarColumns(Double.posInf) is generating the
>>>> brute force...
>>>>
>>>> Basically dimsum with gamma as PositiveInfinity will produce the exact
>>>> same result as doing catesian products of RDD[(product, vector)] and
>>>> computing similarities or there will be some approximation ?
>>>>
>>>> Sorry I have not read your paper yet. Will read it over the weekend.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com> wrote:
>>>>
>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>
>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>> threshold.
>>>>>
>>>>> Dimensionality reduction should help, and how effective it is will
>>>>> depend on your application and domain, it's worth trying if the direct
>>>>> computation doesn't work.
>>>>>
>>>>> You can also try running KMeans clustering (perhaps after
>>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>>> instead of all pairs above a threshold.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <debasish.das83@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 and
>>>>>> then run all pair similarity...
>>>>>>
>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>
>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are
>>>>>>> ~ 60M and columns are 10M say with billion data points...
>>>>>>>
>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>
>>>>>>> I guess for the second one both all pair and dimsum will run fine...
>>>>>>>
>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>>>>>>
>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You might want to wait until Wednesday since the interface will be
>>>>>>>> changing in that PR before Wednesday, probably over the weekend, so that
>>>>>>>> you don't have to redo your code. Your call if you need it before a week.
>>>>>>>> Reza
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me
>>>>>>>>> pull it in and test on our dataset...
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>> Deb
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Deb,
>>>>>>>>>>
>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in
>>>>>>>>>> this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>>
>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Reza
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>
>>>>>>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>>>>>>> computation with something like the following in Spark ?
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>
>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>
>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>
>>>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>>>> dimsum based similarity ?
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>> Deb
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <reza@databricks.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>
>>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>
>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though
>>>>>>>>>>>> it will need refactoring given the recent changes to matrix interface in
>>>>>>>>>>>> MLlib. You may implement the sampling scheme for your own app since it's
>>>>>>>>>>>> much code.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Reza
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8
>>>>>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a stage for
>>>>>>>>>>>>> about several hours without any further information. Maybe I need to find
>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the 1T
>>>>>>>>>>>>>> pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>> million users. I need to compute the similarity between each pair of users
>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get top k most
>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Reza Zadeh <re...@databricks.com>.

I will add dice, overlap, and jaccard similarity in a future PR, probably
still for 1.2


On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <de...@gmail.com>
wrote:

> Awesome...Let me try it out...
>
> Any plans of putting other similarity measures in future (jaccard is
> something that will be useful) ? I guess it makes sense to add some
> similarity measures in mllib...
>
>
> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com> wrote:
>
>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns it
>> into the usual brute force algorithm for cosine similarity, there is no
>> sampling. This is by design.
>>
>>
>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> I looked at the code: similarColumns(Double.posInf) is generating the
>>> brute force...
>>>
>>> Basically dimsum with gamma as PositiveInfinity will produce the exact
>>> same result as doing catesian products of RDD[(product, vector)] and
>>> computing similarities or there will be some approximation ?
>>>
>>> Sorry I have not read your paper yet. Will read it over the weekend.
>>>
>>>
>>>
>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com> wrote:
>>>
>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>
>>>> For 60M x 10M probably brute force won't work depending on the
>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>> threshold.
>>>>
>>>> Dimensionality reduction should help, and how effective it is will
>>>> depend on your application and domain, it's worth trying if the direct
>>>> computation doesn't work.
>>>>
>>>> You can also try running KMeans clustering (perhaps after
>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>> instead of all pairs above a threshold.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 and
>>>>> then run all pair similarity...
>>>>>
>>>>> Did you also try similar ideas and saw positive results ?
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <debasish.das83@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are
>>>>>> ~ 60M and columns are 10M say with billion data points...
>>>>>>
>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>
>>>>>> I guess for the second one both all pair and dimsum will run fine...
>>>>>>
>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>>>>>
>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> You might want to wait until Wednesday since the interface will be
>>>>>>> changing in that PR before Wednesday, probably over the weekend, so that
>>>>>>> you don't have to redo your code. Your call if you need it before a week.
>>>>>>> Reza
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me
>>>>>>>> pull it in and test on our dataset...
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> Deb
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Deb,
>>>>>>>>>
>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in
>>>>>>>>> this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>
>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Reza
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Reza,
>>>>>>>>>>
>>>>>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>>>>>> computation with something like the following in Spark ?
>>>>>>>>>>
>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>
>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>
>>>>>>>>>> Note that the data is sparse for me (the data that goes to matrix
>>>>>>>>>> factorization) so I don't think joining and group-by on (product,product)
>>>>>>>>>> will be a big issue for me...
>>>>>>>>>>
>>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>>> dimsum based similarity ?
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>> Deb
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>
>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>
>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it
>>>>>>>>>>> will need refactoring given the recent changes to matrix interface in
>>>>>>>>>>> MLlib. You may implement the sampling scheme for your own app since it's
>>>>>>>>>>> much code.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Reza
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8
>>>>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a stage for
>>>>>>>>>>>> about several hours without any further information. Maybe I need to find
>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the 1T
>>>>>>>>>>>>> pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>> million users. I need to compute the similarity between each pair of users
>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get top k most
>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Debasish Das <de...@gmail.com>.

Awesome...Let me try it out...

Any plans of putting other similarity measures in future (jaccard is
something that will be useful) ? I guess it makes sense to add some
similarity measures in mllib...


On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <re...@databricks.com> wrote:

> Yes you're right, calling dimsum with gamma as PositiveInfinity turns it
> into the usual brute force algorithm for cosine similarity, there is no
> sampling. This is by design.
>
>
> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> I looked at the code: similarColumns(Double.posInf) is generating the
>> brute force...
>>
>> Basically dimsum with gamma as PositiveInfinity will produce the exact
>> same result as doing catesian products of RDD[(product, vector)] and
>> computing similarities or there will be some approximation ?
>>
>> Sorry I have not read your paper yet. Will read it over the weekend.
>>
>>
>>
>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com> wrote:
>>
>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>
>>> For 60M x 10M probably brute force won't work depending on the cluster's
>>> power, and dimsum thresholding should work with appropriate threshold.
>>>
>>> Dimensionality reduction should help, and how effective it is will
>>> depend on your application and domain, it's worth trying if the direct
>>> computation doesn't work.
>>>
>>> You can also try running KMeans clustering (perhaps after dimensionality
>>> reduction) if your goal is to find batches of similar points instead of all
>>> pairs above a threshold.
>>>
>>>
>>>
>>>
>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 and
>>>> then run all pair similarity...
>>>>
>>>> Did you also try similar ideas and saw positive results ?
>>>>
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~
>>>>> 60M and columns are 10M say with billion data points...
>>>>>
>>>>> I have another version that's around 60M and ~ 10K...
>>>>>
>>>>> I guess for the second one both all pair and dimsum will run fine...
>>>>>
>>>>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>>>>
>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> You might want to wait until Wednesday since the interface will be
>>>>>> changing in that PR before Wednesday, probably over the weekend, so that
>>>>>> you don't have to redo your code. Your call if you need it before a week.
>>>>>> Reza
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>
>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me
>>>>>>> pull it in and test on our dataset...
>>>>>>>
>>>>>>> Thanks.
>>>>>>> Deb
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Deb,
>>>>>>>>
>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in
>>>>>>>> this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>
>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Reza
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Reza,
>>>>>>>>>
>>>>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>>>>> computation with something like the following in Spark ?
>>>>>>>>>
>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>
>>>>>>>>> I am adding cosine similarity computation but I do want to compute
>>>>>>>>> an all pair similarities...
>>>>>>>>>
>>>>>>>>> Note that the data is sparse for me (the data that goes to matrix
>>>>>>>>> factorization) so I don't think joining and group-by on (product,product)
>>>>>>>>> will be a big issue for me...
>>>>>>>>>
>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>> dimsum based similarity ?
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>> Deb
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>
>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>
>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it
>>>>>>>>>> will need refactoring given the recent changes to matrix interface in
>>>>>>>>>> MLlib. You may implement the sampling scheme for your own app since it's
>>>>>>>>>> much code.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Reza
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8
>>>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a stage for
>>>>>>>>>>> about several hours without any further information. Maybe I need to find
>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>> andrew@andrewash.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  Run the
>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to (user,
>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>
>>>>>>>>>>>> I doubt that you'll be able to take this approach with the 1T
>>>>>>>>>>>> pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>>>>>>>> users. I need to compute the similarity between each pair of users using
>>>>>>>>>>>>> some user's attributes.  For each user, I need to get top k most similar
>>>>>>>>>>>>> users. What is the best way to implement this?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Reza Zadeh <re...@databricks.com>.

Yes you're right, calling dimsum with gamma as PositiveInfinity turns it
into the usual brute force algorithm for cosine similarity, there is no
sampling. This is by design.


On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <de...@gmail.com>
wrote:

> I looked at the code: similarColumns(Double.posInf) is generating the
> brute force...
>
> Basically dimsum with gamma as PositiveInfinity will produce the exact
> same result as doing catesian products of RDD[(product, vector)] and
> computing similarities or there will be some approximation ?
>
> Sorry I have not read your paper yet. Will read it over the weekend.
>
>
>
> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com> wrote:
>
>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>
>> For 60M x 10M probably brute force won't work depending on the cluster's
>> power, and dimsum thresholding should work with appropriate threshold.
>>
>> Dimensionality reduction should help, and how effective it is will depend
>> on your application and domain, it's worth trying if the direct computation
>> doesn't work.
>>
>> You can also try running KMeans clustering (perhaps after dimensionality
>> reduction) if your goal is to find batches of similar points instead of all
>> pairs above a threshold.
>>
>>
>>
>>
>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>> running a matrix factorization to reduce the dimension to say ~60M x 50 and
>>> then run all pair similarity...
>>>
>>> Did you also try similar ideas and saw positive results ?
>>>
>>>
>>>
>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~
>>>> 60M and columns are 10M say with billion data points...
>>>>
>>>> I have another version that's around 60M and ~ 10K...
>>>>
>>>> I guess for the second one both all pair and dimsum will run fine...
>>>>
>>>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>>>
>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com> wrote:
>>>>
>>>>> You might want to wait until Wednesday since the interface will be
>>>>> changing in that PR before Wednesday, probably over the weekend, so that
>>>>> you don't have to redo your code. Your call if you need it before a week.
>>>>> Reza
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <debasish.das83@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me
>>>>>> pull it in and test on our dataset...
>>>>>>
>>>>>> Thanks.
>>>>>> Deb
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Deb,
>>>>>>>
>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in this
>>>>>>> PR: https://github.com/apache/spark/pull/1778
>>>>>>>
>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>
>>>>>>> Best,
>>>>>>> Reza
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Reza,
>>>>>>>>
>>>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>>>> computation with something like the following in Spark ?
>>>>>>>>
>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>
>>>>>>>> I am adding cosine similarity computation but I do want to compute
>>>>>>>> an all pair similarities...
>>>>>>>>
>>>>>>>> Note that the data is sparse for me (the data that goes to matrix
>>>>>>>> factorization) so I don't think joining and group-by on (product,product)
>>>>>>>> will be a big issue for me...
>>>>>>>>
>>>>>>>> Does it make sense to add all pair similarities as well with dimsum
>>>>>>>> based similarity ?
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> Deb
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <re...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Xiaoli,
>>>>>>>>>
>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>
>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it
>>>>>>>>> will need refactoring given the recent changes to matrix interface in
>>>>>>>>> MLlib. You may implement the sampling scheme for your own app since it's
>>>>>>>>> much code.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Reza
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8
>>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a stage for
>>>>>>>>>> about several hours without any further information. Maybe I need to find
>>>>>>>>>> out a more efficient way.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <andrew@andrewash.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> The naive way would be to put all the users and their attributes
>>>>>>>>>>> into an RDD, then cartesian product that with itself.  Run the similarity
>>>>>>>>>>> score on every pair (1M * 1M => 1T scores), map to (user, (score,
>>>>>>>>>>> otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>
>>>>>>>>>>> I doubt that you'll be able to take this approach with the 1T
>>>>>>>>>>> pairs though, so it might be worth looking at the literature for
>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>>>>>>> users. I need to compute the similarity between each pair of users using
>>>>>>>>>>>> some user's attributes.  For each user, I need to get top k most similar
>>>>>>>>>>>> users. What is the best way to implement this?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Debasish Das <de...@gmail.com>.

I looked at the code: similarColumns(Double.posInf) is generating the brute
force...

Basically dimsum with gamma as PositiveInfinity will produce the exact same
result as doing catesian products of RDD[(product, vector)] and computing
similarities or there will be some approximation ?

Sorry I have not read your paper yet. Will read it over the weekend.



On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <re...@databricks.com> wrote:

> For 60M x 10K brute force and dimsum thresholding should be fine.
>
> For 60M x 10M probably brute force won't work depending on the cluster's
> power, and dimsum thresholding should work with appropriate threshold.
>
> Dimensionality reduction should help, and how effective it is will depend
> on your application and domain, it's worth trying if the direct computation
> doesn't work.
>
> You can also try running KMeans clustering (perhaps after dimensionality
> reduction) if your goal is to find batches of similar points instead of all
> pairs above a threshold.
>
>
>
>
> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Also for tall and wide (rows ~60M, columns 10M), I am considering running
>> a matrix factorization to reduce the dimension to say ~60M x 50 and then
>> run all pair similarity...
>>
>> Did you also try similar ideas and saw positive results ?
>>
>>
>>
>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~
>>> 60M and columns are 10M say with billion data points...
>>>
>>> I have another version that's around 60M and ~ 10K...
>>>
>>> I guess for the second one both all pair and dimsum will run fine...
>>>
>>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>>
>>> I might need jaccard as well...can I plug that in the PR ?
>>>
>>>
>>>
>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com> wrote:
>>>
>>>> You might want to wait until Wednesday since the interface will be
>>>> changing in that PR before Wednesday, probably over the weekend, so that
>>>> you don't have to redo your code. Your call if you need it before a week.
>>>> Reza
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me
>>>>> pull it in and test on our dataset...
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Deb,
>>>>>>
>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in this
>>>>>> PR: https://github.com/apache/spark/pull/1778
>>>>>>
>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>
>>>>>> Best,
>>>>>> Reza
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>> debasish.das83@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Reza,
>>>>>>>
>>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>>> computation with something like the following in Spark ?
>>>>>>>
>>>>>>> https://github.com/echen/scaldingale
>>>>>>>
>>>>>>> I am adding cosine similarity computation but I do want to compute
>>>>>>> an all pair similarities...
>>>>>>>
>>>>>>> Note that the data is sparse for me (the data that goes to matrix
>>>>>>> factorization) so I don't think joining and group-by on (product,product)
>>>>>>> will be a big issue for me...
>>>>>>>
>>>>>>> Does it make sense to add all pair similarities as well with dimsum
>>>>>>> based similarity ?
>>>>>>>
>>>>>>> Thanks.
>>>>>>> Deb
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <re...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Xiaoli,
>>>>>>>>
>>>>>>>> There is a PR currently in progress to allow this, via the sampling
>>>>>>>> scheme described in this paper:
>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>
>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it
>>>>>>>> will need refactoring given the recent changes to matrix interface in
>>>>>>>> MLlib. You may implement the sampling scheme for your own app since it's
>>>>>>>> much code.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Reza
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolimail1@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Andrew,
>>>>>>>>>
>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8
>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a stage for
>>>>>>>>> about several hours without any further information. Maybe I need to find
>>>>>>>>> out a more efficient way.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <an...@andrewash.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The naive way would be to put all the users and their attributes
>>>>>>>>>> into an RDD, then cartesian product that with itself.  Run the similarity
>>>>>>>>>> score on every pair (1M * 1M => 1T scores), map to (user, (score,
>>>>>>>>>> otherUser)) and take the .top(k) for each user.
>>>>>>>>>>
>>>>>>>>>> I doubt that you'll be able to take this approach with the 1T
>>>>>>>>>> pairs though, so it might be worth looking at the literature for
>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>>>>>> users. I need to compute the similarity between each pair of users using
>>>>>>>>>>> some user's attributes.  For each user, I need to get top k most similar
>>>>>>>>>>> users. What is the best way to implement this?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Reza Zadeh <re...@databricks.com>.

For 60M x 10K brute force and dimsum thresholding should be fine.

For 60M x 10M probably brute force won't work depending on the cluster's
power, and dimsum thresholding should work with appropriate threshold.

Dimensionality reduction should help, and how effective it is will depend
on your application and domain, it's worth trying if the direct computation
doesn't work.

You can also try running KMeans clustering (perhaps after dimensionality
reduction) if your goal is to find batches of similar points instead of all
pairs above a threshold.




On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <de...@gmail.com>
wrote:

> Also for tall and wide (rows ~60M, columns 10M), I am considering running
> a matrix factorization to reduce the dimension to say ~60M x 50 and then
> run all pair similarity...
>
> Did you also try similar ideas and saw positive results ?
>
>
>
> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~
>> 60M and columns are 10M say with billion data points...
>>
>> I have another version that's around 60M and ~ 10K...
>>
>> I guess for the second one both all pair and dimsum will run fine...
>>
>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>
>> I might need jaccard as well...can I plug that in the PR ?
>>
>>
>>
>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com> wrote:
>>
>>> You might want to wait until Wednesday since the interface will be
>>> changing in that PR before Wednesday, probably over the weekend, so that
>>> you don't have to redo your code. Your call if you need it before a week.
>>> Reza
>>>
>>>
>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me pull
>>>> it in and test on our dataset...
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com> wrote:
>>>>
>>>>> Hi Deb,
>>>>>
>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in this
>>>>> PR: https://github.com/apache/spark/pull/1778
>>>>>
>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>
>>>>> Best,
>>>>> Reza
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <debasish.das83@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Reza,
>>>>>>
>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>> computation with something like the following in Spark ?
>>>>>>
>>>>>> https://github.com/echen/scaldingale
>>>>>>
>>>>>> I am adding cosine similarity computation but I do want to compute an
>>>>>> all pair similarities...
>>>>>>
>>>>>> Note that the data is sparse for me (the data that goes to matrix
>>>>>> factorization) so I don't think joining and group-by on (product,product)
>>>>>> will be a big issue for me...
>>>>>>
>>>>>> Does it make sense to add all pair similarities as well with dimsum
>>>>>> based similarity ?
>>>>>>
>>>>>> Thanks.
>>>>>> Deb
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <re...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Xiaoli,
>>>>>>>
>>>>>>> There is a PR currently in progress to allow this, via the sampling
>>>>>>> scheme described in this paper:
>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>
>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it
>>>>>>> will need refactoring given the recent changes to matrix interface in
>>>>>>> MLlib. You may implement the sampling scheme for your own app since it's
>>>>>>> much code.
>>>>>>>
>>>>>>> Best,
>>>>>>> Reza
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <li...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8 nodes
>>>>>>>> and every node has 8G memory. The program just stopped at a stage for about
>>>>>>>> several hours without any further information. Maybe I need to find
>>>>>>>> out a more efficient way.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <an...@andrewash.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The naive way would be to put all the users and their attributes
>>>>>>>>> into an RDD, then cartesian product that with itself.  Run the similarity
>>>>>>>>> score on every pair (1M * 1M => 1T scores), map to (user, (score,
>>>>>>>>> otherUser)) and take the .top(k) for each user.
>>>>>>>>>
>>>>>>>>> I doubt that you'll be able to take this approach with the 1T
>>>>>>>>> pairs though, so it might be worth looking at the literature for
>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>> lixiaolimail1@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>>>>> users. I need to compute the similarity between each pair of users using
>>>>>>>>>> some user's attributes.  For each user, I need to get top k most similar
>>>>>>>>>> users. What is the best way to implement this?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Debasish Das <de...@gmail.com>.

Also for tall and wide (rows ~60M, columns 10M), I am considering running a
matrix factorization to reduce the dimension to say ~60M x 50 and then run
all pair similarity...

Did you also try similar ideas and saw positive results ?



On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <de...@gmail.com>
wrote:

> Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~ 60M
> and columns are 10M say with billion data points...
>
> I have another version that's around 60M and ~ 10K...
>
> I guess for the second one both all pair and dimsum will run fine...
>
> But for tall and wide, what do you suggest ? can dimsum handle it ?
>
> I might need jaccard as well...can I plug that in the PR ?
>
>
>
> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com> wrote:
>
>> You might want to wait until Wednesday since the interface will be
>> changing in that PR before Wednesday, probably over the weekend, so that
>> you don't have to redo your code. Your call if you need it before a week.
>> Reza
>>
>>
>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me pull
>>> it in and test on our dataset...
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com> wrote:
>>>
>>>> Hi Deb,
>>>>
>>>> We are adding all-pairs and thresholded all-pairs via dimsum in this
>>>> PR: https://github.com/apache/spark/pull/1778
>>>>
>>>> Your question wasn't entirely clear - does this answer it?
>>>>
>>>> Best,
>>>> Reza
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Reza,
>>>>>
>>>>> Have you compared with the brute force algorithm for similarity
>>>>> computation with something like the following in Spark ?
>>>>>
>>>>> https://github.com/echen/scaldingale
>>>>>
>>>>> I am adding cosine similarity computation but I do want to compute an
>>>>> all pair similarities...
>>>>>
>>>>> Note that the data is sparse for me (the data that goes to matrix
>>>>> factorization) so I don't think joining and group-by on (product,product)
>>>>> will be a big issue for me...
>>>>>
>>>>> Does it make sense to add all pair similarities as well with dimsum
>>>>> based similarity ?
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <re...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Xiaoli,
>>>>>>
>>>>>> There is a PR currently in progress to allow this, via the sampling
>>>>>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>
>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it will
>>>>>> need refactoring given the recent changes to matrix interface in MLlib. You
>>>>>> may implement the sampling scheme for your own app since it's much code.
>>>>>>
>>>>>> Best,
>>>>>> Reza
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <li...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> Thanks for your suggestion. I have tried the method. I used 8 nodes
>>>>>>> and every node has 8G memory. The program just stopped at a stage for about
>>>>>>> several hours without any further information. Maybe I need to find
>>>>>>> out a more efficient way.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <an...@andrewash.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The naive way would be to put all the users and their attributes
>>>>>>>> into an RDD, then cartesian product that with itself.  Run the similarity
>>>>>>>> score on every pair (1M * 1M => 1T scores), map to (user, (score,
>>>>>>>> otherUser)) and take the .top(k) for each user.
>>>>>>>>
>>>>>>>> I doubt that you'll be able to take this approach with the 1T pairs
>>>>>>>> though, so it might be worth looking at the literature for recommender
>>>>>>>> systems to see what else is out there.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolimail1@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>>>> users. I need to compute the similarity between each pair of users using
>>>>>>>>> some user's attributes.  For each user, I need to get top k most similar
>>>>>>>>> users. What is the best way to implement this?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Debasish Das <de...@gmail.com>.

Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~ 60M
and columns are 10M say with billion data points...

I have another version that's around 60M and ~ 10K...

I guess for the second one both all pair and dimsum will run fine...

But for tall and wide, what do you suggest ? can dimsum handle it ?

I might need jaccard as well...can I plug that in the PR ?



On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <re...@databricks.com> wrote:

> You might want to wait until Wednesday since the interface will be
> changing in that PR before Wednesday, probably over the weekend, so that
> you don't have to redo your code. Your call if you need it before a week.
> Reza
>
>
> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Ohh cool....all-pairs brute force is also part of this PR ? Let me pull
>> it in and test on our dataset...
>>
>> Thanks.
>> Deb
>>
>>
>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com> wrote:
>>
>>> Hi Deb,
>>>
>>> We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
>>> https://github.com/apache/spark/pull/1778
>>>
>>> Your question wasn't entirely clear - does this answer it?
>>>
>>> Best,
>>> Reza
>>>
>>>
>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> Hi Reza,
>>>>
>>>> Have you compared with the brute force algorithm for similarity
>>>> computation with something like the following in Spark ?
>>>>
>>>> https://github.com/echen/scaldingale
>>>>
>>>> I am adding cosine similarity computation but I do want to compute an
>>>> all pair similarities...
>>>>
>>>> Note that the data is sparse for me (the data that goes to matrix
>>>> factorization) so I don't think joining and group-by on (product,product)
>>>> will be a big issue for me...
>>>>
>>>> Does it make sense to add all pair similarities as well with dimsum
>>>> based similarity ?
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <re...@databricks.com>
>>>> wrote:
>>>>
>>>>> Hi Xiaoli,
>>>>>
>>>>> There is a PR currently in progress to allow this, via the sampling
>>>>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>>>>>
>>>>> The PR is at https://github.com/apache/spark/pull/336 though it will
>>>>> need refactoring given the recent changes to matrix interface in MLlib. You
>>>>> may implement the sampling scheme for your own app since it's much code.
>>>>>
>>>>> Best,
>>>>> Reza
>>>>>
>>>>>
>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> Thanks for your suggestion. I have tried the method. I used 8 nodes
>>>>>> and every node has 8G memory. The program just stopped at a stage for about
>>>>>> several hours without any further information. Maybe I need to find
>>>>>> out a more efficient way.
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <an...@andrewash.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The naive way would be to put all the users and their attributes
>>>>>>> into an RDD, then cartesian product that with itself.  Run the similarity
>>>>>>> score on every pair (1M * 1M => 1T scores), map to (user, (score,
>>>>>>> otherUser)) and take the .top(k) for each user.
>>>>>>>
>>>>>>> I doubt that you'll be able to take this approach with the 1T pairs
>>>>>>> though, so it might be worth looking at the literature for recommender
>>>>>>> systems to see what else is out there.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <li...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>>> users. I need to compute the similarity between each pair of users using
>>>>>>>> some user's attributes.  For each user, I need to get top k most similar
>>>>>>>> users. What is the best way to implement this?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Reza Zadeh <re...@databricks.com>.

You might want to wait until Wednesday since the interface will be changing
in that PR before Wednesday, probably over the weekend, so that you don't
have to redo your code. Your call if you need it before a week.
Reza


On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <de...@gmail.com>
wrote:

> Ohh cool....all-pairs brute force is also part of this PR ? Let me pull it
> in and test on our dataset...
>
> Thanks.
> Deb
>
>
> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com> wrote:
>
>> Hi Deb,
>>
>> We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
>> https://github.com/apache/spark/pull/1778
>>
>> Your question wasn't entirely clear - does this answer it?
>>
>> Best,
>> Reza
>>
>>
>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Hi Reza,
>>>
>>> Have you compared with the brute force algorithm for similarity
>>> computation with something like the following in Spark ?
>>>
>>> https://github.com/echen/scaldingale
>>>
>>> I am adding cosine similarity computation but I do want to compute an
>>> all pair similarities...
>>>
>>> Note that the data is sparse for me (the data that goes to matrix
>>> factorization) so I don't think joining and group-by on (product,product)
>>> will be a big issue for me...
>>>
>>> Does it make sense to add all pair similarities as well with dimsum
>>> based similarity ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <re...@databricks.com> wrote:
>>>
>>>> Hi Xiaoli,
>>>>
>>>> There is a PR currently in progress to allow this, via the sampling
>>>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>>>>
>>>> The PR is at https://github.com/apache/spark/pull/336 though it will
>>>> need refactoring given the recent changes to matrix interface in MLlib. You
>>>> may implement the sampling scheme for your own app since it's much code.
>>>>
>>>> Best,
>>>> Reza
>>>>
>>>>
>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <li...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> Thanks for your suggestion. I have tried the method. I used 8 nodes
>>>>> and every node has 8G memory. The program just stopped at a stage for about
>>>>> several hours without any further information. Maybe I need to find
>>>>> out a more efficient way.
>>>>>
>>>>>
>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <an...@andrewash.com>
>>>>> wrote:
>>>>>
>>>>>> The naive way would be to put all the users and their attributes into
>>>>>> an RDD, then cartesian product that with itself.  Run the similarity score
>>>>>> on every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and
>>>>>> take the .top(k) for each user.
>>>>>>
>>>>>> I doubt that you'll be able to take this approach with the 1T pairs
>>>>>> though, so it might be worth looking at the literature for recommender
>>>>>> systems to see what else is out there.
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <li...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>> users. I need to compute the similarity between each pair of users using
>>>>>>> some user's attributes.  For each user, I need to get top k most similar
>>>>>>> users. What is the best way to implement this?
>>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Debasish Das <de...@gmail.com>.

Ohh cool....all-pairs brute force is also part of this PR ? Let me pull it
in and test on our dataset...

Thanks.
Deb


On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <re...@databricks.com> wrote:

> Hi Deb,
>
> We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
> https://github.com/apache/spark/pull/1778
>
> Your question wasn't entirely clear - does this answer it?
>
> Best,
> Reza
>
>
> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Hi Reza,
>>
>> Have you compared with the brute force algorithm for similarity
>> computation with something like the following in Spark ?
>>
>> https://github.com/echen/scaldingale
>>
>> I am adding cosine similarity computation but I do want to compute an all
>> pair similarities...
>>
>> Note that the data is sparse for me (the data that goes to matrix
>> factorization) so I don't think joining and group-by on (product,product)
>> will be a big issue for me...
>>
>> Does it make sense to add all pair similarities as well with dimsum based
>> similarity ?
>>
>> Thanks.
>> Deb
>>
>>
>>
>>
>>
>>
>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <re...@databricks.com> wrote:
>>
>>> Hi Xiaoli,
>>>
>>> There is a PR currently in progress to allow this, via the sampling
>>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>>>
>>> The PR is at https://github.com/apache/spark/pull/336 though it will
>>> need refactoring given the recent changes to matrix interface in MLlib. You
>>> may implement the sampling scheme for your own app since it's much code.
>>>
>>> Best,
>>> Reza
>>>
>>>
>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <li...@gmail.com>
>>> wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>> Thanks for your suggestion. I have tried the method. I used 8 nodes and
>>>> every node has 8G memory. The program just stopped at a stage for about
>>>> several hours without any further information. Maybe I need to find
>>>> out a more efficient way.
>>>>
>>>>
>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <an...@andrewash.com>
>>>> wrote:
>>>>
>>>>> The naive way would be to put all the users and their attributes into
>>>>> an RDD, then cartesian product that with itself.  Run the similarity score
>>>>> on every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and
>>>>> take the .top(k) for each user.
>>>>>
>>>>> I doubt that you'll be able to take this approach with the 1T pairs
>>>>> though, so it might be worth looking at the literature for recommender
>>>>> systems to see what else is out there.
>>>>>
>>>>>
>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am implementing an algorithm using Spark. I have one million users.
>>>>>> I need to compute the similarity between each pair of users using some
>>>>>> user's attributes.  For each user, I need to get top k most similar users.
>>>>>> What is the best way to implement this?
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Posted by Reza Zadeh <re...@databricks.com>.

Hi Deb,

We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
https://github.com/apache/spark/pull/1778

Your question wasn't entirely clear - does this answer it?

Best,
Reza


On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <de...@gmail.com>
wrote:

> Hi Reza,
>
> Have you compared with the brute force algorithm for similarity
> computation with something like the following in Spark ?
>
> https://github.com/echen/scaldingale
>
> I am adding cosine similarity computation but I do want to compute an all
> pair similarities...
>
> Note that the data is sparse for me (the data that goes to matrix
> factorization) so I don't think joining and group-by on (product,product)
> will be a big issue for me...
>
> Does it make sense to add all pair similarities as well with dimsum based
> similarity ?
>
> Thanks.
> Deb
>
>
>
>
>
>
> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <re...@databricks.com> wrote:
>
>> Hi Xiaoli,
>>
>> There is a PR currently in progress to allow this, via the sampling
>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>>
>> The PR is at https://github.com/apache/spark/pull/336 though it will
>> need refactoring given the recent changes to matrix interface in MLlib. You
>> may implement the sampling scheme for your own app since it's much code.
>>
>> Best,
>> Reza
>>
>>
>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <li...@gmail.com>
>> wrote:
>>
>>> Hi Andrew,
>>>
>>> Thanks for your suggestion. I have tried the method. I used 8 nodes and
>>> every node has 8G memory. The program just stopped at a stage for about
>>> several hours without any further information. Maybe I need to find
>>> out a more efficient way.
>>>
>>>
>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <an...@andrewash.com>
>>> wrote:
>>>
>>>> The naive way would be to put all the users and their attributes into
>>>> an RDD, then cartesian product that with itself.  Run the similarity score
>>>> on every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and
>>>> take the .top(k) for each user.
>>>>
>>>> I doubt that you'll be able to take this approach with the 1T pairs
>>>> though, so it might be worth looking at the literature for recommender
>>>> systems to see what else is out there.
>>>>
>>>>
>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <li...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am implementing an algorithm using Spark. I have one million users.
>>>>> I need to compute the similarity between each pair of users using some
>>>>> user's attributes.  For each user, I need to get top k most similar users.
>>>>> What is the best way to implement this?
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>>
>>>
>>
>