You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Nikhil Mishra <ni...@gmail.com> on 2016/10/21 06:12:15 UTC

ALS.trainImplicit block sizes

Hi,

I have a question about the block size to be specified in
ALS.trainImplicit() in pyspark (Spark 1.6.1). There is only one block size
parameter to be specified. I want to know if that would result in
partitioning both the users as well as the items axes.

For example, I am using the following call to ALs.trainImplicit() in my
code.

---------------

RANK = 50

ITERATIONS = 2

BLOCKS = 1000

ALPHA = 1.0

model = ALS.trainImplicit(ratings, RANK, ITERATIONS, blocks=BLOCKS,
alpha=ALPHA)


----------------

Will this partition the users x items matrix into BLOCKS x BLOCKS number of
matrices or will it partition only the users axis thereby resulting in
BLOCKS number of matrices, each with columns = total number of unique items?

Thanks,
Nik

Re: ALS.trainImplicit block sizes

Posted by Nick Pentreath <ni...@gmail.com>.
Oh also you mention 20 partitions. Is that how many you have? How many
ratings?

It may be worth trying to reparation to larger number of partitions.

On Fri, 21 Oct 2016 at 17:04, Nick Pentreath <ni...@gmail.com>
wrote:

> I wonder if you can try with setting different blocks for user and item?
> Are you able to try 2.0 or use Scala for setting it in 1.6?
>
> You want your item blocks to be a lot less than user blocks. Items maybe
> 5-10, users perhaps 250-500?
>
> Do you have many "power items" that are connected to almost every user? Or
> vice versa?
>
> On Fri, 21 Oct 2016 at 16:46, Nikhil Mishra <ni...@gmail.com>
> wrote:
>
> Yes, that's what I tried initially. The default value is pretty low -
> something like 20. Default depends on the number of partitions in the
> ratings RDD. It was going out of memory with the default size too.
>
> On Fri, Oct 21, 2016 at 5:31 AM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
> Did you try not setting the blocks parameter? It will then try to set it
> automatically for your data size.
> On Fri, 21 Oct 2016 at 09:16, Nikhil Mishra <ni...@gmail.com>
> wrote:
>
> I am using 105 nodes (1 master, 4 core and 100 task nodes). All are 7.5
> gig machines.
>
> On Fri, Oct 21, 2016 at 12:15 AM, Nick Pentreath <nick.pentreath@gmail.com
> > wrote:
>
> How many nodes are you using in the cluster?
>
>
>
> On Fri, 21 Oct 2016 at 08:58 Nikhil Mishra <ni...@gmail.com>
> wrote:
>
> Thanks Nick.
>
> So we do partition U x I matrix into BxB matrices, each of size around U/B
> and I/B. Is that correct? Do you know whether a single block of the matrix
> is represented in memory as a full matrix or as sparse matrix? I ask this
> because my job has been failing for block sizes which should have worked.
>
> I have U = 85 million users, I = 250,000 items and when I specify block
> size 5,000, I get out of memory error, even though I am setting
> --executor-memory as 7g (on a linux EC2 which has 7.5g memory). Assuming
> each block has 17000 users and 50 items, eve if the block is internally
> represented as a full matrix, it should still occupy around 50MB space.
>
> Increasing block size to 20,000 also results in the same. So there is
> something I don't understand about how this is working.
>
> BTW, I am trying to find 50 latent factors (rank = 50).
>
> Do you have some insights as to how I should tweak things to get this
> working?
>
> Thanks,
> Nik
>
> On Thu, Oct 20, 2016 at 11:43 PM, Nick Pentreath <nick.pentreath@gmail.com
> > wrote:
>
> The blocks params will set both user and item blocks.
>
> Spark 2.0 supports user and item blocks for PySpark:
> http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation
>
>
> On Fri, 21 Oct 2016 at 08:12 Nikhil Mishra <ni...@gmail.com>
> wrote:
>
> Hi,
>
> I have a question about the block size to be specified in
> ALS.trainImplicit() in pyspark (Spark 1.6.1). There is only one block size
> parameter to be specified. I want to know if that would result in
> partitioning both the users as well as the items axes.
>
> For example, I am using the following call to ALs.trainImplicit() in my
> code.
>
> ---------------
>
> RANK = 50
>
> ITERATIONS = 2
>
> BLOCKS = 1000
>
> ALPHA = 1.0
>
> model = ALS.trainImplicit(ratings, RANK, ITERATIONS, blocks=BLOCKS,
> alpha=ALPHA)
>
>
> ----------------
>
> Will this partition the users x items matrix into BLOCKS x BLOCKS number
> of matrices or will it partition only the users axis thereby resulting in
> BLOCKS number of matrices, each with columns = total number of unique items?
>
> Thanks,
> Nik
>
>
>
>
>

Re: ALS.trainImplicit block sizes

Posted by Nick Pentreath <ni...@gmail.com>.
I wonder if you can try with setting different blocks for user and item?
Are you able to try 2.0 or use Scala for setting it in 1.6?

You want your item blocks to be a lot less than user blocks. Items maybe
5-10, users perhaps 250-500?

Do you have many "power items" that are connected to almost every user? Or
vice versa?

On Fri, 21 Oct 2016 at 16:46, Nikhil Mishra <ni...@gmail.com>
wrote:

> Yes, that's what I tried initially. The default value is pretty low -
> something like 20. Default depends on the number of partitions in the
> ratings RDD. It was going out of memory with the default size too.
>
> On Fri, Oct 21, 2016 at 5:31 AM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
> Did you try not setting the blocks parameter? It will then try to set it
> automatically for your data size.
> On Fri, 21 Oct 2016 at 09:16, Nikhil Mishra <ni...@gmail.com>
> wrote:
>
> I am using 105 nodes (1 master, 4 core and 100 task nodes). All are 7.5
> gig machines.
>
> On Fri, Oct 21, 2016 at 12:15 AM, Nick Pentreath <nick.pentreath@gmail.com
> > wrote:
>
> How many nodes are you using in the cluster?
>
>
>
> On Fri, 21 Oct 2016 at 08:58 Nikhil Mishra <ni...@gmail.com>
> wrote:
>
> Thanks Nick.
>
> So we do partition U x I matrix into BxB matrices, each of size around U/B
> and I/B. Is that correct? Do you know whether a single block of the matrix
> is represented in memory as a full matrix or as sparse matrix? I ask this
> because my job has been failing for block sizes which should have worked.
>
> I have U = 85 million users, I = 250,000 items and when I specify block
> size 5,000, I get out of memory error, even though I am setting
> --executor-memory as 7g (on a linux EC2 which has 7.5g memory). Assuming
> each block has 17000 users and 50 items, eve if the block is internally
> represented as a full matrix, it should still occupy around 50MB space.
>
> Increasing block size to 20,000 also results in the same. So there is
> something I don't understand about how this is working.
>
> BTW, I am trying to find 50 latent factors (rank = 50).
>
> Do you have some insights as to how I should tweak things to get this
> working?
>
> Thanks,
> Nik
>
> On Thu, Oct 20, 2016 at 11:43 PM, Nick Pentreath <nick.pentreath@gmail.com
> > wrote:
>
> The blocks params will set both user and item blocks.
>
> Spark 2.0 supports user and item blocks for PySpark:
> http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation
>
>
> On Fri, 21 Oct 2016 at 08:12 Nikhil Mishra <ni...@gmail.com>
> wrote:
>
> Hi,
>
> I have a question about the block size to be specified in
> ALS.trainImplicit() in pyspark (Spark 1.6.1). There is only one block size
> parameter to be specified. I want to know if that would result in
> partitioning both the users as well as the items axes.
>
> For example, I am using the following call to ALs.trainImplicit() in my
> code.
>
> ---------------
>
> RANK = 50
>
> ITERATIONS = 2
>
> BLOCKS = 1000
>
> ALPHA = 1.0
>
> model = ALS.trainImplicit(ratings, RANK, ITERATIONS, blocks=BLOCKS,
> alpha=ALPHA)
>
>
> ----------------
>
> Will this partition the users x items matrix into BLOCKS x BLOCKS number
> of matrices or will it partition only the users axis thereby resulting in
> BLOCKS number of matrices, each with columns = total number of unique items?
>
> Thanks,
> Nik
>
>
>
>
>

Re: ALS.trainImplicit block sizes

Posted by Nick Pentreath <ni...@gmail.com>.
How many nodes are you using in the cluster?



On Fri, 21 Oct 2016 at 08:58 Nikhil Mishra <ni...@gmail.com>
wrote:

> Thanks Nick.
>
> So we do partition U x I matrix into BxB matrices, each of size around U/B
> and I/B. Is that correct? Do you know whether a single block of the matrix
> is represented in memory as a full matrix or as sparse matrix? I ask this
> because my job has been failing for block sizes which should have worked.
>
> I have U = 85 million users, I = 250,000 items and when I specify block
> size 5,000, I get out of memory error, even though I am setting
> --executor-memory as 7g (on a linux EC2 which has 7.5g memory). Assuming
> each block has 17000 users and 50 items, eve if the block is internally
> represented as a full matrix, it should still occupy around 50MB space.
>
> Increasing block size to 20,000 also results in the same. So there is
> something I don't understand about how this is working.
>
> BTW, I am trying to find 50 latent factors (rank = 50).
>
> Do you have some insights as to how I should tweak things to get this
> working?
>
> Thanks,
> Nik
>
> On Thu, Oct 20, 2016 at 11:43 PM, Nick Pentreath <nick.pentreath@gmail.com
> > wrote:
>
> The blocks params will set both user and item blocks.
>
> Spark 2.0 supports user and item blocks for PySpark:
> http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation
>
>
> On Fri, 21 Oct 2016 at 08:12 Nikhil Mishra <ni...@gmail.com>
> wrote:
>
> Hi,
>
> I have a question about the block size to be specified in
> ALS.trainImplicit() in pyspark (Spark 1.6.1). There is only one block size
> parameter to be specified. I want to know if that would result in
> partitioning both the users as well as the items axes.
>
> For example, I am using the following call to ALs.trainImplicit() in my
> code.
>
> ---------------
>
> RANK = 50
>
> ITERATIONS = 2
>
> BLOCKS = 1000
>
> ALPHA = 1.0
>
> model = ALS.trainImplicit(ratings, RANK, ITERATIONS, blocks=BLOCKS,
> alpha=ALPHA)
>
>
> ----------------
>
> Will this partition the users x items matrix into BLOCKS x BLOCKS number
> of matrices or will it partition only the users axis thereby resulting in
> BLOCKS number of matrices, each with columns = total number of unique items?
>
> Thanks,
> Nik
>
>
>

Re: ALS.trainImplicit block sizes

Posted by Nick Pentreath <ni...@gmail.com>.
The blocks params will set both user and item blocks.

Spark 2.0 supports user and item blocks for PySpark:
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation

On Fri, 21 Oct 2016 at 08:12 Nikhil Mishra <ni...@gmail.com>
wrote:

> Hi,
>
> I have a question about the block size to be specified in
> ALS.trainImplicit() in pyspark (Spark 1.6.1). There is only one block size
> parameter to be specified. I want to know if that would result in
> partitioning both the users as well as the items axes.
>
> For example, I am using the following call to ALs.trainImplicit() in my
> code.
>
> ---------------
>
> RANK = 50
>
> ITERATIONS = 2
>
> BLOCKS = 1000
>
> ALPHA = 1.0
>
> model = ALS.trainImplicit(ratings, RANK, ITERATIONS, blocks=BLOCKS,
> alpha=ALPHA)
>
>
> ----------------
>
> Will this partition the users x items matrix into BLOCKS x BLOCKS number
> of matrices or will it partition only the users axis thereby resulting in
> BLOCKS number of matrices, each with columns = total number of unique items?
>
> Thanks,
> Nik
>