You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by nguyen duc Tuan <ne...@gmail.com> on 2017/02/09 07:55:31 UTC

Practical configuration to run LSH in Spark 2.1.0

Hi everyone,
Since spark 2.1.0 introduces LSH (
http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing),
we want to use LSH to find approximately nearest neighbors. Basically, We
have dataset with about 7M rows. we want to use cosine distance to meassure
the similarity between items, so we use *RandomSignProjectionLSH* (
https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead of
*BucketedRandomProjectionLSH*. I try to tune some configurations such as
serialization, memory fraction, executor memory (~6G), number of executors
( ~20), memory overhead ..., but nothing works. I often get error
"java.lang.OutOfMemoryError:
Java heap space" while running. I know that this implementation is done by
engineer at Uber but I don't know right configurations,.. to run the
algorithm at scale. Do they need very big memory to run it?

Any help would be appreciated.
Thanks

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by nguyen duc Tuan <ne...@gmail.com>.
I do a self-join. I tried to cache the transformed dataset before joining,
but it didn't help too.

2017-02-23 13:25 GMT+07:00 Nick Pentreath <ni...@gmail.com>:

> And to be clear, are you doing a self-join for approx similarity? Or
> joining to another dataset?
>
>
>
> On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan <ne...@gmail.com>
> wrote:
>
>> Hi Seth,
>> Here's the parameters that I used in my experiments.
>> - Number of executors: 16
>> - Executor's memories: vary from 1G -> 2G -> 3G
>> - Number of cores per executor: 1-> 2
>> - Driver's memory:  1G -> 2G -> 3G
>> - The similar threshold: 0.6
>> MinHash:
>> - number of hash tables: 2
>> SignedRandomProjection:
>> - Number of hash tables: 2
>>
>> 2017-02-23 0:13 GMT+07:00 Seth Hendrickson <se...@gmail.com>
>> :
>>
>> I'm looking into this a bit further, thanks for bringing it up! Right now
>> the LSH implementation only uses OR-amplification. The practical
>> consequence of this is that it will select too many candidates when doing
>> approximate near neighbor search and approximate similarity join. When we
>> add AND-amplification I think it will become significantly more usable. In
>> the meantime, I will also investigate scalability issues.
>>
>> Can you please provide every parameter you used? It will be very helfpul
>> :) For instance, the similarity threshold, the number of hash tables, the
>> bucket width, etc...
>>
>> Thanks!
>>
>> On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>> The original Uber authors provided this performance test result:
>> https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_
>> mrg_-vLro
>>
>> This was for MinHash only though, so it's not clear about what the
>> scalability is for the other metric types.
>>
>> The SignRandomProjectionLSH is not yet in Spark master (see
>> https://issues.apache.org/jira/browse/SPARK-18082). It could be there
>> are some implementation details that would make a difference here.
>>
>> By the way, what is the join threshold you use in approx join?
>>
>> Could you perhaps create a JIRA ticket with the details in order to track
>> this?
>>
>>
>> On Sun, 12 Feb 2017 at 22:52 nguyen duc Tuan <ne...@gmail.com>
>> wrote:
>>
>> After all, I switched back to LSH implementation that I used before (
>> https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
>> now. If someone has any suggestion, please tell me.
>> Thanks.
>>
>> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <ne...@gmail.com>:
>>
>> Hi Timur,
>> 1) Our data is transformed to dataset of Vector already.
>> 2) If I use RandomSignProjectLSH, the job dies after I call
>> approximateSimilarJoin. I tried to use Minhash instead, the job is still
>> slow. I don't thinks the problem is related to the GC. The time for GC is
>> small compare with the time for computation. Here is some screenshots of my
>> job.
>> Thanks
>>
>> 2017-02-12 8:01 GMT+07:00 Timur Shenkao <ts...@timshenkao.su>:
>>
>> Hello,
>>
>> 1) Are you sure that your data is "clean"?  No unexpected missing values?
>> No strings in unusual encoding? No additional or missing columns ?
>> 2) How long does your job run? What about garbage collector parameters?
>> Have you checked what happens with jconsole / jvisualvm ?
>>
>> Sincerely yours, Timur
>>
>> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <ne...@gmail.com>
>> wrote:
>>
>> Hi Nick,
>> Because we use *RandomSignProjectionLSH*, there is only one parameter
>> for LSH is the number of hashes. I try with small number of hashes (2) but
>> the error is still happens. And it happens when I call similarity join.
>> After transformation, the size of  dataset is about 4G.
>>
>> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <ni...@gmail.com>:
>>
>> What other params are you using for the lsh transformer?
>>
>> Are the issues occurring during transform or during the similarity join?
>>
>>
>> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <ne...@gmail.com>
>> wrote:
>>
>> hi Das,
>> In general, I will apply them to larger datasets, so I want to use LSH,
>> which is more scaleable than the approaches as you suggested. Have you
>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>> parameters/configuration to make it work ?
>> Thanks.
>>
>> 2017-02-10 19:21 GMT+07:00 Debasish Das <de...@gmail.com>:
>>
>> If it is 7m rows and 700k features (or say 1m features) brute force row
>> similarity will run fine as well...check out spark-4823...you can compare
>> quality with approximate variant...
>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com> wrote:
>>
>> Hi everyone,
>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
>> to find approximately nearest neighbors. Basically, We have dataset with
>> about 7M rows. we want to use cosine distance to meassure the similarity
>> between items, so we use *RandomSignProjectionLSH* (
>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
>> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
>> as serialization, memory fraction, executor memory (~6G), number of
>> executors ( ~20), memory overhead ..., but nothing works. I often get error
>> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
>> this implementation is done by engineer at Uber but I don't know right
>> configurations,.. to run the algorithm at scale. Do they need very big
>> memory to run it?
>>
>> Any help would be appreciated.
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by Nick Pentreath <ni...@gmail.com>.
And to be clear, are you doing a self-join for approx similarity? Or
joining to another dataset?



On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan <ne...@gmail.com> wrote:

> Hi Seth,
> Here's the parameters that I used in my experiments.
> - Number of executors: 16
> - Executor's memories: vary from 1G -> 2G -> 3G
> - Number of cores per executor: 1-> 2
> - Driver's memory:  1G -> 2G -> 3G
> - The similar threshold: 0.6
> MinHash:
> - number of hash tables: 2
> SignedRandomProjection:
> - Number of hash tables: 2
>
> 2017-02-23 0:13 GMT+07:00 Seth Hendrickson <se...@gmail.com>:
>
> I'm looking into this a bit further, thanks for bringing it up! Right now
> the LSH implementation only uses OR-amplification. The practical
> consequence of this is that it will select too many candidates when doing
> approximate near neighbor search and approximate similarity join. When we
> add AND-amplification I think it will become significantly more usable. In
> the meantime, I will also investigate scalability issues.
>
> Can you please provide every parameter you used? It will be very helfpul
> :) For instance, the similarity threshold, the number of hash tables, the
> bucket width, etc...
>
> Thanks!
>
> On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
> The original Uber authors provided this performance test result:
> https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro
>
> This was for MinHash only though, so it's not clear about what the
> scalability is for the other metric types.
>
> The SignRandomProjectionLSH is not yet in Spark master (see
> https://issues.apache.org/jira/browse/SPARK-18082). It could be there are
> some implementation details that would make a difference here.
>
> By the way, what is the join threshold you use in approx join?
>
> Could you perhaps create a JIRA ticket with the details in order to track
> this?
>
>
> On Sun, 12 Feb 2017 at 22:52 nguyen duc Tuan <ne...@gmail.com> wrote:
>
> After all, I switched back to LSH implementation that I used before (
> https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
> now. If someone has any suggestion, please tell me.
> Thanks.
>
> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <ne...@gmail.com>:
>
> Hi Timur,
> 1) Our data is transformed to dataset of Vector already.
> 2) If I use RandomSignProjectLSH, the job dies after I call
> approximateSimilarJoin. I tried to use Minhash instead, the job is still
> slow. I don't thinks the problem is related to the GC. The time for GC is
> small compare with the time for computation. Here is some screenshots of my
> job.
> Thanks
>
> 2017-02-12 8:01 GMT+07:00 Timur Shenkao <ts...@timshenkao.su>:
>
> Hello,
>
> 1) Are you sure that your data is "clean"?  No unexpected missing values?
> No strings in unusual encoding? No additional or missing columns ?
> 2) How long does your job run? What about garbage collector parameters?
> Have you checked what happens with jconsole / jvisualvm ?
>
> Sincerely yours, Timur
>
> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <ne...@gmail.com>
> wrote:
>
> Hi Nick,
> Because we use *RandomSignProjectionLSH*, there is only one parameter for
> LSH is the number of hashes. I try with small number of hashes (2) but the
> error is still happens. And it happens when I call similarity join. After
> transformation, the size of  dataset is about 4G.
>
> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <ni...@gmail.com>:
>
> What other params are you using for the lsh transformer?
>
> Are the issues occurring during transform or during the similarity join?
>
>
> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <ne...@gmail.com>
> wrote:
>
> hi Das,
> In general, I will apply them to larger datasets, so I want to use LSH,
> which is more scaleable than the approaches as you suggested. Have you
> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
> parameters/configuration to make it work ?
> Thanks.
>
> 2017-02-10 19:21 GMT+07:00 Debasish Das <de...@gmail.com>:
>
> If it is 7m rows and 700k features (or say 1m features) brute force row
> similarity will run fine as well...check out spark-4823...you can compare
> quality with approximate variant...
> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com> wrote:
>
> Hi everyone,
> Since spark 2.1.0 introduces LSH (
> http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing),
> we want to use LSH to find approximately nearest neighbors. Basically, We
> have dataset with about 7M rows. we want to use cosine distance to meassure
> the similarity between items, so we use *RandomSignProjectionLSH* (
> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
> as serialization, memory fraction, executor memory (~6G), number of
> executors ( ~20), memory overhead ..., but nothing works. I often get error
> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
> this implementation is done by engineer at Uber but I don't know right
> configurations,.. to run the algorithm at scale. Do they need very big
> memory to run it?
>
> Any help would be appreciated.
> Thanks
>
>
>
>
>
>
>
>
>

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by nguyen duc Tuan <ne...@gmail.com>.
Hi Seth,
Here's the parameters that I used in my experiments.
- Number of executors: 16
- Executor's memories: vary from 1G -> 2G -> 3G
- Number of cores per executor: 1-> 2
- Driver's memory:  1G -> 2G -> 3G
- The similar threshold: 0.6
MinHash:
- number of hash tables: 2
SignedRandomProjection:
- Number of hash tables: 2

2017-02-23 0:13 GMT+07:00 Seth Hendrickson <se...@gmail.com>:

> I'm looking into this a bit further, thanks for bringing it up! Right now
> the LSH implementation only uses OR-amplification. The practical
> consequence of this is that it will select too many candidates when doing
> approximate near neighbor search and approximate similarity join. When we
> add AND-amplification I think it will become significantly more usable. In
> the meantime, I will also investigate scalability issues.
>
> Can you please provide every parameter you used? It will be very helfpul
> :) For instance, the similarity threshold, the number of hash tables, the
> bucket width, etc...
>
> Thanks!
>
> On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
>> The original Uber authors provided this performance test result:
>> https://docs.google.com/document/d/19BXg-67U83NVB3M0
>> I84HVBVg3baAVaESD_mrg_-vLro
>>
>> This was for MinHash only though, so it's not clear about what the
>> scalability is for the other metric types.
>>
>> The SignRandomProjectionLSH is not yet in Spark master (see
>> https://issues.apache.org/jira/browse/SPARK-18082). It could be there
>> are some implementation details that would make a difference here.
>>
>> By the way, what is the join threshold you use in approx join?
>>
>> Could you perhaps create a JIRA ticket with the details in order to track
>> this?
>>
>>
>> On Sun, 12 Feb 2017 at 22:52 nguyen duc Tuan <ne...@gmail.com>
>> wrote:
>>
>>> After all, I switched back to LSH implementation that I used before (
>>> https://github.com/karlhigley/spark-neighbors ). I can run on my
>>> dataset now. If someone has any suggestion, please tell me.
>>> Thanks.
>>>
>>> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <ne...@gmail.com>:
>>>
>>> Hi Timur,
>>> 1) Our data is transformed to dataset of Vector already.
>>> 2) If I use RandomSignProjectLSH, the job dies after I call
>>> approximateSimilarJoin. I tried to use Minhash instead, the job is still
>>> slow. I don't thinks the problem is related to the GC. The time for GC is
>>> small compare with the time for computation. Here is some screenshots of my
>>> job.
>>> Thanks
>>>
>>> 2017-02-12 8:01 GMT+07:00 Timur Shenkao <ts...@timshenkao.su>:
>>>
>>> Hello,
>>>
>>> 1) Are you sure that your data is "clean"?  No unexpected missing
>>> values? No strings in unusual encoding? No additional or missing columns ?
>>> 2) How long does your job run? What about garbage collector parameters?
>>> Have you checked what happens with jconsole / jvisualvm ?
>>>
>>> Sincerely yours, Timur
>>>
>>> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <ne...@gmail.com>
>>> wrote:
>>>
>>> Hi Nick,
>>> Because we use *RandomSignProjectionLSH*, there is only one parameter
>>> for LSH is the number of hashes. I try with small number of hashes (2) but
>>> the error is still happens. And it happens when I call similarity join.
>>> After transformation, the size of  dataset is about 4G.
>>>
>>> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <ni...@gmail.com>:
>>>
>>> What other params are you using for the lsh transformer?
>>>
>>> Are the issues occurring during transform or during the similarity join?
>>>
>>>
>>> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <ne...@gmail.com>
>>> wrote:
>>>
>>> hi Das,
>>> In general, I will apply them to larger datasets, so I want to use LSH,
>>> which is more scaleable than the approaches as you suggested. Have you
>>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>>> parameters/configuration to make it work ?
>>> Thanks.
>>>
>>> 2017-02-10 19:21 GMT+07:00 Debasish Das <de...@gmail.com>:
>>>
>>> If it is 7m rows and 700k features (or say 1m features) brute force row
>>> similarity will run fine as well...check out spark-4823...you can compare
>>> quality with approximate variant...
>>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com> wrote:
>>>
>>> Hi everyone,
>>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
>>> to find approximately nearest neighbors. Basically, We have dataset with
>>> about 7M rows. we want to use cosine distance to meassure the similarity
>>> between items, so we use *RandomSignProjectionLSH* (
>>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db)
>>> instead of *BucketedRandomProjectionLSH*. I try to tune some
>>> configurations such as serialization, memory fraction, executor memory
>>> (~6G), number of executors ( ~20), memory overhead ..., but nothing works.
>>> I often get error "java.lang.OutOfMemoryError: Java heap space" while
>>> running. I know that this implementation is done by engineer at Uber but I
>>> don't know right configurations,.. to run the algorithm at scale. Do they
>>> need very big memory to run it?
>>>
>>> Any help would be appreciated.
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by Seth Hendrickson <se...@gmail.com>.
I'm looking into this a bit further, thanks for bringing it up! Right now
the LSH implementation only uses OR-amplification. The practical
consequence of this is that it will select too many candidates when doing
approximate near neighbor search and approximate similarity join. When we
add AND-amplification I think it will become significantly more usable. In
the meantime, I will also investigate scalability issues.

Can you please provide every parameter you used? It will be very helfpul :)
For instance, the similarity threshold, the number of hash tables, the
bucket width, etc...

Thanks!

On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> The original Uber authors provided this performance test result:
> https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_
> mrg_-vLro
>
> This was for MinHash only though, so it's not clear about what the
> scalability is for the other metric types.
>
> The SignRandomProjectionLSH is not yet in Spark master (see
> https://issues.apache.org/jira/browse/SPARK-18082). It could be there are
> some implementation details that would make a difference here.
>
> By the way, what is the join threshold you use in approx join?
>
> Could you perhaps create a JIRA ticket with the details in order to track
> this?
>
>
> On Sun, 12 Feb 2017 at 22:52 nguyen duc Tuan <ne...@gmail.com> wrote:
>
>> After all, I switched back to LSH implementation that I used before (
>> https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
>> now. If someone has any suggestion, please tell me.
>> Thanks.
>>
>> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <ne...@gmail.com>:
>>
>> Hi Timur,
>> 1) Our data is transformed to dataset of Vector already.
>> 2) If I use RandomSignProjectLSH, the job dies after I call
>> approximateSimilarJoin. I tried to use Minhash instead, the job is still
>> slow. I don't thinks the problem is related to the GC. The time for GC is
>> small compare with the time for computation. Here is some screenshots of my
>> job.
>> Thanks
>>
>> 2017-02-12 8:01 GMT+07:00 Timur Shenkao <ts...@timshenkao.su>:
>>
>> Hello,
>>
>> 1) Are you sure that your data is "clean"?  No unexpected missing values?
>> No strings in unusual encoding? No additional or missing columns ?
>> 2) How long does your job run? What about garbage collector parameters?
>> Have you checked what happens with jconsole / jvisualvm ?
>>
>> Sincerely yours, Timur
>>
>> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <ne...@gmail.com>
>> wrote:
>>
>> Hi Nick,
>> Because we use *RandomSignProjectionLSH*, there is only one parameter
>> for LSH is the number of hashes. I try with small number of hashes (2) but
>> the error is still happens. And it happens when I call similarity join.
>> After transformation, the size of  dataset is about 4G.
>>
>> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <ni...@gmail.com>:
>>
>> What other params are you using for the lsh transformer?
>>
>> Are the issues occurring during transform or during the similarity join?
>>
>>
>> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <ne...@gmail.com>
>> wrote:
>>
>> hi Das,
>> In general, I will apply them to larger datasets, so I want to use LSH,
>> which is more scaleable than the approaches as you suggested. Have you
>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>> parameters/configuration to make it work ?
>> Thanks.
>>
>> 2017-02-10 19:21 GMT+07:00 Debasish Das <de...@gmail.com>:
>>
>> If it is 7m rows and 700k features (or say 1m features) brute force row
>> similarity will run fine as well...check out spark-4823...you can compare
>> quality with approximate variant...
>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com> wrote:
>>
>> Hi everyone,
>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
>> to find approximately nearest neighbors. Basically, We have dataset with
>> about 7M rows. we want to use cosine distance to meassure the similarity
>> between items, so we use *RandomSignProjectionLSH* (
>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
>> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
>> as serialization, memory fraction, executor memory (~6G), number of
>> executors ( ~20), memory overhead ..., but nothing works. I often get error
>> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
>> this implementation is done by engineer at Uber but I don't know right
>> configurations,.. to run the algorithm at scale. Do they need very big
>> memory to run it?
>>
>> Any help would be appreciated.
>> Thanks
>>
>>
>>
>>
>>
>>
>>

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by Nick Pentreath <ni...@gmail.com>.
The original Uber authors provided this performance test result:
https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro

This was for MinHash only though, so it's not clear about what the
scalability is for the other metric types.

The SignRandomProjectionLSH is not yet in Spark master (see
https://issues.apache.org/jira/browse/SPARK-18082). It could be there are
some implementation details that would make a difference here.

By the way, what is the join threshold you use in approx join?

Could you perhaps create a JIRA ticket with the details in order to track
this?


On Sun, 12 Feb 2017 at 22:52 nguyen duc Tuan <ne...@gmail.com> wrote:

> After all, I switched back to LSH implementation that I used before (
> https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
> now. If someone has any suggestion, please tell me.
> Thanks.
>
> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <ne...@gmail.com>:
>
> Hi Timur,
> 1) Our data is transformed to dataset of Vector already.
> 2) If I use RandomSignProjectLSH, the job dies after I call
> approximateSimilarJoin. I tried to use Minhash instead, the job is still
> slow. I don't thinks the problem is related to the GC. The time for GC is
> small compare with the time for computation. Here is some screenshots of my
> job.
> Thanks
>
> 2017-02-12 8:01 GMT+07:00 Timur Shenkao <ts...@timshenkao.su>:
>
> Hello,
>
> 1) Are you sure that your data is "clean"?  No unexpected missing values?
> No strings in unusual encoding? No additional or missing columns ?
> 2) How long does your job run? What about garbage collector parameters?
> Have you checked what happens with jconsole / jvisualvm ?
>
> Sincerely yours, Timur
>
> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <ne...@gmail.com>
> wrote:
>
> Hi Nick,
> Because we use *RandomSignProjectionLSH*, there is only one parameter for
> LSH is the number of hashes. I try with small number of hashes (2) but the
> error is still happens. And it happens when I call similarity join. After
> transformation, the size of  dataset is about 4G.
>
> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <ni...@gmail.com>:
>
> What other params are you using for the lsh transformer?
>
> Are the issues occurring during transform or during the similarity join?
>
>
> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <ne...@gmail.com>
> wrote:
>
> hi Das,
> In general, I will apply them to larger datasets, so I want to use LSH,
> which is more scaleable than the approaches as you suggested. Have you
> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
> parameters/configuration to make it work ?
> Thanks.
>
> 2017-02-10 19:21 GMT+07:00 Debasish Das <de...@gmail.com>:
>
> If it is 7m rows and 700k features (or say 1m features) brute force row
> similarity will run fine as well...check out spark-4823...you can compare
> quality with approximate variant...
> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com> wrote:
>
> Hi everyone,
> Since spark 2.1.0 introduces LSH (
> http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing),
> we want to use LSH to find approximately nearest neighbors. Basically, We
> have dataset with about 7M rows. we want to use cosine distance to meassure
> the similarity between items, so we use *RandomSignProjectionLSH* (
> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
> as serialization, memory fraction, executor memory (~6G), number of
> executors ( ~20), memory overhead ..., but nothing works. I often get error
> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
> this implementation is done by engineer at Uber but I don't know right
> configurations,.. to run the algorithm at scale. Do they need very big
> memory to run it?
>
> Any help would be appreciated.
> Thanks
>
>
>
>
>
>
>

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by nguyen duc Tuan <ne...@gmail.com>.
After all, I switched back to LSH implementation that I used before (
https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
now. If someone has any suggestion, please tell me.
Thanks.

2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <ne...@gmail.com>:

> Hi Timur,
> 1) Our data is transformed to dataset of Vector already.
> 2) If I use RandomSignProjectLSH, the job dies after I call
> approximateSimilarJoin. I tried to use Minhash instead, the job is still
> slow. I don't thinks the problem is related to the GC. The time for GC is
> small compare with the time for computation. Here is some screenshots of my
> job.
> Thanks
>
> 2017-02-12 8:01 GMT+07:00 Timur Shenkao <ts...@timshenkao.su>:
>
>> Hello,
>>
>> 1) Are you sure that your data is "clean"?  No unexpected missing values?
>> No strings in unusual encoding? No additional or missing columns ?
>> 2) How long does your job run? What about garbage collector parameters?
>> Have you checked what happens with jconsole / jvisualvm ?
>>
>> Sincerely yours, Timur
>>
>> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <ne...@gmail.com>
>> wrote:
>>
>>> Hi Nick,
>>> Because we use *RandomSignProjectionLSH*, there is only one parameter
>>> for LSH is the number of hashes. I try with small number of hashes (2) but
>>> the error is still happens. And it happens when I call similarity join.
>>> After transformation, the size of  dataset is about 4G.
>>>
>>> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <ni...@gmail.com>:
>>>
>>>> What other params are you using for the lsh transformer?
>>>>
>>>> Are the issues occurring during transform or during the similarity join?
>>>>
>>>>
>>>> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <ne...@gmail.com>
>>>> wrote:
>>>>
>>>>> hi Das,
>>>>> In general, I will apply them to larger datasets, so I want to use
>>>>> LSH, which is more scaleable than the approaches as you suggested. Have you
>>>>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>>>>> parameters/configuration to make it work ?
>>>>> Thanks.
>>>>>
>>>>> 2017-02-10 19:21 GMT+07:00 Debasish Das <de...@gmail.com>:
>>>>>
>>>>> If it is 7m rows and 700k features (or say 1m features) brute force
>>>>> row similarity will run fine as well...check out spark-4823...you can
>>>>> compare quality with approximate variant...
>>>>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi everyone,
>>>>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>>>>> latest/ml-features.html#locality-sensitive-hashing), we want to use
>>>>> LSH to find approximately nearest neighbors. Basically, We have dataset
>>>>> with about 7M rows. we want to use cosine distance to meassure the
>>>>> similarity between items, so we use *RandomSignProjectionLSH* (
>>>>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db)
>>>>> instead of *BucketedRandomProjectionLSH*. I try to tune some
>>>>> configurations such as serialization, memory fraction, executor memory
>>>>> (~6G), number of executors ( ~20), memory overhead ..., but nothing works.
>>>>> I often get error "java.lang.OutOfMemoryError: Java heap space" while
>>>>> running. I know that this implementation is done by engineer at Uber but I
>>>>> don't know right configurations,.. to run the algorithm at scale. Do they
>>>>> need very big memory to run it?
>>>>>
>>>>> Any help would be appreciated.
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by nguyen duc Tuan <ne...@gmail.com>.
Hi Timur,
1) Our data is transformed to dataset of Vector already.
2) If I use RandomSignProjectLSH, the job dies after I call
approximateSimilarJoin. I tried to use Minhash instead, the job is still
slow. I don't thinks the problem is related to the GC. The time for GC is
small compare with the time for computation. Here is some screenshots of my
job.
Thanks

2017-02-12 8:01 GMT+07:00 Timur Shenkao <ts...@timshenkao.su>:

> Hello,
>
> 1) Are you sure that your data is "clean"?  No unexpected missing values?
> No strings in unusual encoding? No additional or missing columns ?
> 2) How long does your job run? What about garbage collector parameters?
> Have you checked what happens with jconsole / jvisualvm ?
>
> Sincerely yours, Timur
>
> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <ne...@gmail.com>
> wrote:
>
>> Hi Nick,
>> Because we use *RandomSignProjectionLSH*, there is only one parameter
>> for LSH is the number of hashes. I try with small number of hashes (2) but
>> the error is still happens. And it happens when I call similarity join.
>> After transformation, the size of  dataset is about 4G.
>>
>> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <ni...@gmail.com>:
>>
>>> What other params are you using for the lsh transformer?
>>>
>>> Are the issues occurring during transform or during the similarity join?
>>>
>>>
>>> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <ne...@gmail.com>
>>> wrote:
>>>
>>>> hi Das,
>>>> In general, I will apply them to larger datasets, so I want to use LSH,
>>>> which is more scaleable than the approaches as you suggested. Have you
>>>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>>>> parameters/configuration to make it work ?
>>>> Thanks.
>>>>
>>>> 2017-02-10 19:21 GMT+07:00 Debasish Das <de...@gmail.com>:
>>>>
>>>> If it is 7m rows and 700k features (or say 1m features) brute force row
>>>> similarity will run fine as well...check out spark-4823...you can compare
>>>> quality with approximate variant...
>>>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com> wrote:
>>>>
>>>> Hi everyone,
>>>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>>>> latest/ml-features.html#locality-sensitive-hashing), we want to use
>>>> LSH to find approximately nearest neighbors. Basically, We have dataset
>>>> with about 7M rows. we want to use cosine distance to meassure the
>>>> similarity between items, so we use *RandomSignProjectionLSH* (
>>>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db)
>>>> instead of *BucketedRandomProjectionLSH*. I try to tune some
>>>> configurations such as serialization, memory fraction, executor memory
>>>> (~6G), number of executors ( ~20), memory overhead ..., but nothing works.
>>>> I often get error "java.lang.OutOfMemoryError: Java heap space" while
>>>> running. I know that this implementation is done by engineer at Uber but I
>>>> don't know right configurations,.. to run the algorithm at scale. Do they
>>>> need very big memory to run it?
>>>>
>>>> Any help would be appreciated.
>>>> Thanks
>>>>
>>>>
>>>>
>>
>

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by Timur Shenkao <ts...@timshenkao.su>.
Hello,

1) Are you sure that your data is "clean"?  No unexpected missing values?
No strings in unusual encoding? No additional or missing columns ?
2) How long does your job run? What about garbage collector parameters?
Have you checked what happens with jconsole / jvisualvm ?

Sincerely yours, Timur

On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <ne...@gmail.com>
wrote:

> Hi Nick,
> Because we use *RandomSignProjectionLSH*, there is only one parameter for
> LSH is the number of hashes. I try with small number of hashes (2) but the
> error is still happens. And it happens when I call similarity join. After
> transformation, the size of  dataset is about 4G.
>
> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <ni...@gmail.com>:
>
>> What other params are you using for the lsh transformer?
>>
>> Are the issues occurring during transform or during the similarity join?
>>
>>
>> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <ne...@gmail.com>
>> wrote:
>>
>>> hi Das,
>>> In general, I will apply them to larger datasets, so I want to use LSH,
>>> which is more scaleable than the approaches as you suggested. Have you
>>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>>> parameters/configuration to make it work ?
>>> Thanks.
>>>
>>> 2017-02-10 19:21 GMT+07:00 Debasish Das <de...@gmail.com>:
>>>
>>> If it is 7m rows and 700k features (or say 1m features) brute force row
>>> similarity will run fine as well...check out spark-4823...you can compare
>>> quality with approximate variant...
>>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com> wrote:
>>>
>>> Hi everyone,
>>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
>>> to find approximately nearest neighbors. Basically, We have dataset with
>>> about 7M rows. we want to use cosine distance to meassure the similarity
>>> between items, so we use *RandomSignProjectionLSH* (
>>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db)
>>> instead of *BucketedRandomProjectionLSH*. I try to tune some
>>> configurations such as serialization, memory fraction, executor memory
>>> (~6G), number of executors ( ~20), memory overhead ..., but nothing works.
>>> I often get error "java.lang.OutOfMemoryError: Java heap space" while
>>> running. I know that this implementation is done by engineer at Uber but I
>>> don't know right configurations,.. to run the algorithm at scale. Do they
>>> need very big memory to run it?
>>>
>>> Any help would be appreciated.
>>> Thanks
>>>
>>>
>>>
>

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by nguyen duc Tuan <ne...@gmail.com>.
Hi Nick,
Because we use *RandomSignProjectionLSH*, there is only one parameter for
LSH is the number of hashes. I try with small number of hashes (2) but the
error is still happens. And it happens when I call similarity join. After
transformation, the size of  dataset is about 4G.

2017-02-11 3:07 GMT+07:00 Nick Pentreath <ni...@gmail.com>:

> What other params are you using for the lsh transformer?
>
> Are the issues occurring during transform or during the similarity join?
>
>
> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <ne...@gmail.com>
> wrote:
>
>> hi Das,
>> In general, I will apply them to larger datasets, so I want to use LSH,
>> which is more scaleable than the approaches as you suggested. Have you
>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>> parameters/configuration to make it work ?
>> Thanks.
>>
>> 2017-02-10 19:21 GMT+07:00 Debasish Das <de...@gmail.com>:
>>
>> If it is 7m rows and 700k features (or say 1m features) brute force row
>> similarity will run fine as well...check out spark-4823...you can compare
>> quality with approximate variant...
>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com> wrote:
>>
>> Hi everyone,
>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
>> to find approximately nearest neighbors. Basically, We have dataset with
>> about 7M rows. we want to use cosine distance to meassure the similarity
>> between items, so we use *RandomSignProjectionLSH* (
>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
>> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
>> as serialization, memory fraction, executor memory (~6G), number of
>> executors ( ~20), memory overhead ..., but nothing works. I often get error
>> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
>> this implementation is done by engineer at Uber but I don't know right
>> configurations,.. to run the algorithm at scale. Do they need very big
>> memory to run it?
>>
>> Any help would be appreciated.
>> Thanks
>>
>>
>>

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by Nick Pentreath <ni...@gmail.com>.
What other params are you using for the lsh transformer?

Are the issues occurring during transform or during the similarity join?


On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <ne...@gmail.com> wrote:

> hi Das,
> In general, I will apply them to larger datasets, so I want to use LSH,
> which is more scaleable than the approaches as you suggested. Have you
> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
> parameters/configuration to make it work ?
> Thanks.
>
> 2017-02-10 19:21 GMT+07:00 Debasish Das <de...@gmail.com>:
>
> If it is 7m rows and 700k features (or say 1m features) brute force row
> similarity will run fine as well...check out spark-4823...you can compare
> quality with approximate variant...
> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com> wrote:
>
> Hi everyone,
> Since spark 2.1.0 introduces LSH (
> http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing),
> we want to use LSH to find approximately nearest neighbors. Basically, We
> have dataset with about 7M rows. we want to use cosine distance to meassure
> the similarity between items, so we use *RandomSignProjectionLSH* (
> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
> as serialization, memory fraction, executor memory (~6G), number of
> executors ( ~20), memory overhead ..., but nothing works. I often get error
> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
> this implementation is done by engineer at Uber but I don't know right
> configurations,.. to run the algorithm at scale. Do they need very big
> memory to run it?
>
> Any help would be appreciated.
> Thanks
>
>
>

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by nguyen duc Tuan <ne...@gmail.com>.
hi Das,
In general, I will apply them to larger datasets, so I want to use LSH,
which is more scaleable than the approaches as you suggested. Have you
tried LSH in Spark 2.1.0 before ? If yes, how do you set the
parameters/configuration to make it work ?
Thanks.

2017-02-10 19:21 GMT+07:00 Debasish Das <de...@gmail.com>:

> If it is 7m rows and 700k features (or say 1m features) brute force row
> similarity will run fine as well...check out spark-4823...you can compare
> quality with approximate variant...
> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com> wrote:
>
>> Hi everyone,
>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
>> to find approximately nearest neighbors. Basically, We have dataset with
>> about 7M rows. we want to use cosine distance to meassure the similarity
>> between items, so we use *RandomSignProjectionLSH* (
>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
>> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
>> as serialization, memory fraction, executor memory (~6G), number of
>> executors ( ~20), memory overhead ..., but nothing works. I often get error
>> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
>> this implementation is done by engineer at Uber but I don't know right
>> configurations,.. to run the algorithm at scale. Do they need very big
>> memory to run it?
>>
>> Any help would be appreciated.
>> Thanks
>>
>

Re: Practical configuration to run LSH in Spark 2.1.0

Posted by Debasish Das <de...@gmail.com>.
If it is 7m rows and 700k features (or say 1m features) brute force row
similarity will run fine as well...check out spark-4823...you can compare
quality with approximate variant...
On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <ne...@gmail.com> wrote:

> Hi everyone,
> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
> to find approximately nearest neighbors. Basically, We have dataset with
> about 7M rows. we want to use cosine distance to meassure the similarity
> between items, so we use *RandomSignProjectionLSH* (
> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
> as serialization, memory fraction, executor memory (~6G), number of
> executors ( ~20), memory overhead ..., but nothing works. I often get error
> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
> this implementation is done by engineer at Uber but I don't know right
> configurations,.. to run the algorithm at scale. Do they need very big
> memory to run it?
>
> Any help would be appreciated.
> Thanks
>