You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@predictionio.apache.org by Bolmo Joosten <bo...@gmail.com> on 2017/05/10 15:45:11 UTC

Problem scaling UR

Hi all,

I have trouble scaling the Universal Recommender to a dataset with 250M
events (purchase, view, atb). It trains ok on a couple of million events,
but the training time becomes very long (>48h) on the large dataset.

Hardware specs:

   - Standalone cluster
   - 20 cores (40 hyper threading)
   - 264GB RAM

Input data size format:

   - We load directly from CSV files and bypass HBASE. Size of CSV is 19 GB.
   - PIO JSON format equivalent size: 150 GB

Train command:

pio train -- --driver-memory 64G --executor-memory 8G --executor-cores 2

I have used various variations with driver, executor memory and number of
cores, but the training time does not seem to be affected by this.

Spark UI tells me the save method (collect > $plus$plus) in URModel.scala
takes a very long time. See attached dumps of the Spark UI for details.

Any suggestions?

Thanks, Bolmo

Re: Problem scaling UR

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Ah, then you have likely over constrained the training since it needs to communicate with all services and they will compete for resources. Vertical scaling on one machine is by no means ideal, horizontal scaling is what we do for custom installs. 

The other thing is using 1 CPU, you can change that by creating a remote cluster, even if it uses 1 machine. But first try passing “-- --master local[10] --conf spark.default.parallelism=40” where the spark.default.parallelism number if something like 4x the cores allocated.


On May 14, 2017, at 9:54 PM, Bolmo Joosten <bo...@gmail.com> wrote:

Those services are all running on the same (development) machine. I disabled the input event load and any other services and it is still taking forever (while using a single CPU). 

Could it be a partitioning issue? According logs each of the correlatorRDDs uses only a single partition…?

> On May 10, 2017, at 6:09 PM, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
> 
> What is the physical architecture? Do you have HBase, Elasticsearch, and Spark running on separate machines? If the CPU load is low then it must be IO bound reading from Hbase or writing to Elasticsearch. Do you have any input event load yet or are you making queries? These will all change the equation and are why separating services to run separately makes the most sense.
> 
> 
> On May 10, 2017, at 1:47 PM, Bolmo Joosten <bolmo.joosten@gmail.com <ma...@gmail.com>> wrote:
> 
> Thanks for your suggestion. I forgot to mention in my last email that the $plus$plus stage takes most time (95%+) and is using only 1-3 CPUs.
> 
> I will give it a try with lower driver memory and higher executor memory.
> 
> Maybe a hard question, any idea what kind of training time I should expect with this data size on this cluster? 
> 
> We modified the default UR template to create the eventRDDs from CSV files instead of HBASE. Hbase was unable to process this amount of data on the cluster. This means we can't provide any personalized recommendations, but that is ok for now. 
> 
> 2017-05-10 10:22 GMT-07:00 Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>>:
> You can’t bypass HBase, you can import JSON to HBase directly so I assume this is what you are saying.
> 
> Executor memory should be higher and driver memory lower. Spark loves memory and in this case the lower limit is all your input events and BiMaps for all user and item ids. If you don’t have an OOM you are above minimum but increasing the executor mem might help, also executor CPUs. The lower limit for the driver mem is roughly equal to the amount per executor.
> 
> One unfortunate thing about Spark is that you can scale it to do the job in minutes but when you go to read or write to/from HBase or Elasticsearch this large a cluster will overload the DBs. So training in a long time is not all that bad a thing since the cluster will probably not be overloading the IO.
> 
> 
> On May 10, 2017, at 8:45 AM, Bolmo Joosten <bolmo.joosten@gmail.com <ma...@gmail.com>> wrote:
> 
> Hi all,
> 
> I have trouble scaling the Universal Recommender to a dataset with 250M events (purchase, view, atb). It trains ok on a couple of million events, but the training time becomes very long (>48h) on the large dataset.
> 
> Hardware specs:
> 
> Standalone cluster
> 20 cores (40 hyper threading)
> 264GB RAM
> Input data size format:
> 
> We load directly from CSV files and bypass HBASE. Size of CSV is 19 GB.
> PIO JSON format equivalent size: 150 GB
> Train command:
> 
> pio train -- --driver-memory 64G --executor-memory 8G --executor-cores 2
> 
> I have used various variations with driver, executor memory and number of cores, but the training time does not seem to be affected by this.
> 
> Spark UI tells me the save method (collect > $plus$plus) in URModel.scala takes a very long time. See attached dumps of the Spark UI for details. 
> 
> Any suggestions?
> 
> Thanks, Bolmo
> 
> 
> 
> 
> 
> 
> 
> 


-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com <ma...@googlegroups.com>.
To post to this group, send email to actionml-user@googlegroups.com <ma...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/2E572762-ACEE-45DA-A5CB-D17C4B9269AC%40gmail.com <https://groups.google.com/d/msgid/actionml-user/2E572762-ACEE-45DA-A5CB-D17C4B9269AC%40gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

Re: Problem scaling UR

Posted by Bolmo Joosten <bo...@gmail.com>.

Those services are all running on the same (development) machine. I disabled the input event load and any other services and it is still taking forever (while using a single CPU). 

Could it be a partitioning issue? According logs each of the correlatorRDDs uses only a single partition…?

> On May 10, 2017, at 6:09 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> What is the physical architecture? Do you have HBase, Elasticsearch, and Spark running on separate machines? If the CPU load is low then it must be IO bound reading from Hbase or writing to Elasticsearch. Do you have any input event load yet or are you making queries? These will all change the equation and are why separating services to run separately makes the most sense.
> 
> 
> On May 10, 2017, at 1:47 PM, Bolmo Joosten <bolmo.joosten@gmail.com <ma...@gmail.com>> wrote:
> 
> Thanks for your suggestion. I forgot to mention in my last email that the $plus$plus stage takes most time (95%+) and is using only 1-3 CPUs.
> 
> I will give it a try with lower driver memory and higher executor memory.
> 
> Maybe a hard question, any idea what kind of training time I should expect with this data size on this cluster? 
> 
> We modified the default UR template to create the eventRDDs from CSV files instead of HBASE. Hbase was unable to process this amount of data on the cluster. This means we can't provide any personalized recommendations, but that is ok for now. 
> 
> 2017-05-10 10:22 GMT-07:00 Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>>:
> You can’t bypass HBase, you can import JSON to HBase directly so I assume this is what you are saying.
> 
> Executor memory should be higher and driver memory lower. Spark loves memory and in this case the lower limit is all your input events and BiMaps for all user and item ids. If you don’t have an OOM you are above minimum but increasing the executor mem might help, also executor CPUs. The lower limit for the driver mem is roughly equal to the amount per executor.
> 
> One unfortunate thing about Spark is that you can scale it to do the job in minutes but when you go to read or write to/from HBase or Elasticsearch this large a cluster will overload the DBs. So training in a long time is not all that bad a thing since the cluster will probably not be overloading the IO.
> 
> 
> On May 10, 2017, at 8:45 AM, Bolmo Joosten <bolmo.joosten@gmail.com <ma...@gmail.com>> wrote:
> 
> Hi all,
> 
> I have trouble scaling the Universal Recommender to a dataset with 250M events (purchase, view, atb). It trains ok on a couple of million events, but the training time becomes very long (>48h) on the large dataset.
> 
> Hardware specs:
> 
> Standalone cluster
> 20 cores (40 hyper threading)
> 264GB RAM
> Input data size format:
> 
> We load directly from CSV files and bypass HBASE. Size of CSV is 19 GB.
> PIO JSON format equivalent size: 150 GB
> Train command:
> 
> pio train -- --driver-memory 64G --executor-memory 8G --executor-cores 2
> 
> I have used various variations with driver, executor memory and number of cores, but the training time does not seem to be affected by this.
> 
> Spark UI tells me the save method (collect > $plus$plus) in URModel.scala takes a very long time. See attached dumps of the Spark UI for details. 
> 
> Any suggestions?
> 
> Thanks, Bolmo
> 
> 
> 
> 
> 
> 
> 
>

Re: Problem scaling UR

Posted by Pat Ferrel <pa...@occamsmachete.com>.

What is the physical architecture? Do you have HBase, Elasticsearch, and Spark running on separate machines? If the CPU load is low then it must be IO bound reading from Hbase or writing to Elasticsearch. Do you have any input event load yet or are you making queries? These will all change the equation and are why separating services to run separately makes the most sense.

On May 10, 2017, at 1:47 PM, Bolmo Joosten <bo...@gmail.com> wrote:

Thanks for your suggestion. I forgot to mention in my last email that the $plus$plus stage takes most time (95%+) and is using only 1-3 CPUs.

I will give it a try with lower driver memory and higher executor memory.

Maybe a hard question, any idea what kind of training time I should expect with this data size on this cluster?

We modified the default UR template to create the eventRDDs from CSV files instead of HBASE. Hbase was unable to process this amount of data on the cluster. This means we can't provide any personalized recommendations, but that is ok for now.

2017-05-10 10:22 GMT-07:00 Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>>:
You can’t bypass HBase, you can import JSON to HBase directly so I assume this is what you are saying.

Executor memory should be higher and driver memory lower. Spark loves memory and in this case the lower limit is all your input events and BiMaps for all user and item ids. If you don’t have an OOM you are above minimum but increasing the executor mem might help, also executor CPUs. The lower limit for the driver mem is roughly equal to the amount per executor.

One unfortunate thing about Spark is that you can scale it to do the job in minutes but when you go to read or write to/from HBase or Elasticsearch this large a cluster will overload the DBs. So training in a long time is not all that bad a thing since the cluster will probably not be overloading the IO.

On May 10, 2017, at 8:45 AM, Bolmo Joosten <bolmo.joosten@gmail.com <ma...@gmail.com>> wrote:

Hi all,

I have trouble scaling the Universal Recommender to a dataset with 250M events (purchase, view, atb). It trains ok on a couple of million events, but the training time becomes very long (>48h) on the large dataset.

Hardware specs:

Standalone cluster
20 cores (40 hyper threading)
264GB RAM
Input data size format:

We load directly from CSV files and bypass HBASE. Size of CSV is 19 GB.
PIO JSON format equivalent size: 150 GB
Train command:

pio train -- --driver-memory 64G --executor-memory 8G --executor-cores 2

I have used various variations with driver, executor memory and number of cores, but the training time does not seem to be affected by this.

Spark UI tells me the save method (collect > $plus$plus) in URModel.scala takes a very long time. See attached dumps of the Spark UI for details.

Any suggestions?

Thanks, Bolmo

Re: Problem scaling UR

Posted by Bolmo Joosten <bo...@gmail.com>.

Thanks for your suggestion. I forgot to mention in my last email that the
$plus$plus stage takes most time (95%+) and is using only 1-3 CPUs.

I will give it a try with lower driver memory and higher executor memory.

Maybe a hard question, any idea what kind of training time I should expect
with this data size on this cluster?

We modified the default UR template to create the eventRDDs from CSV files
instead of HBASE. Hbase was unable to process this amount of data on the
cluster. This means we can't provide any personalized recommendations, but
that is ok for now.

2017-05-10 10:22 GMT-07:00 Pat Ferrel <pa...@occamsmachete.com>:

> You can’t bypass HBase, you can import JSON to HBase directly so I assume
> this is what you are saying.
>
> Executor memory should be higher and driver memory lower. Spark loves
> memory and in this case the lower limit is all your input events and BiMaps
> for all user and item ids. If you don’t have an OOM you are above minimum
> but increasing the executor mem might help, also executor CPUs. The lower
> limit for the driver mem is roughly equal to the amount per executor.
>
> One unfortunate thing about Spark is that you can scale it to do the job
> in minutes but when you go to read or write to/from HBase or Elasticsearch
> this large a cluster will overload the DBs. So training in a long time is
> not all that bad a thing since the cluster will probably not be overloading
> the IO.
>
>
> On May 10, 2017, at 8:45 AM, Bolmo Joosten <bo...@gmail.com>
> wrote:
>
> Hi all,
>
> I have trouble scaling the Universal Recommender to a dataset with 250M
> events (purchase, view, atb). It trains ok on a couple of million events,
> but the training time becomes very long (>48h) on the large dataset.
>
> Hardware specs:
>
>    - Standalone cluster
>    - 20 cores (40 hyper threading)
>    - 264GB RAM
>
> Input data size format:
>
>    - We load directly from CSV files and bypass HBASE. Size of CSV is 19
>    GB.
>    - PIO JSON format equivalent size: 150 GB
>
> Train command:
>
> pio train -- --driver-memory 64G --executor-memory 8G --executor-cores 2
>
> I have used various variations with driver, executor memory and number of
> cores, but the training time does not seem to be affected by this.
>
> Spark UI tells me the save method (collect > $plus$plus) in URModel.scala
> takes a very long time. See attached dumps of the Spark UI for details.
>
> Any suggestions?
>
> Thanks, Bolmo
>
>
>
>
>
>

Re: Problem scaling UR

Posted by Pat Ferrel <pa...@occamsmachete.com>.

You can’t bypass HBase, you can import JSON to HBase directly so I assume this is what you are saying.

On May 10, 2017, at 8:45 AM, Bolmo Joosten <bo...@gmail.com> wrote:

Hi all,

Hardware specs:

Standalone cluster
20 cores (40 hyper threading)
264GB RAM
Input data size format:

We load directly from CSV files and bypass HBASE. Size of CSV is 19 GB.
PIO JSON format equivalent size: 150 GB
Train command:

pio train -- --driver-memory 64G --executor-memory 8G --executor-cores 2

I have used various variations with driver, executor memory and number of cores, but the training time does not seem to be affected by this.

Spark UI tells me the save method (collect > $plus$plus) in URModel.scala takes a very long time. See attached dumps of the Spark UI for details.

Any suggestions?

Thanks, Bolmo