You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aakash Basu <aa...@gmail.com> on 2019/02/11 09:40:32 UTC

Data growth vs Cluster Size planning

Hi,

I ran a dataset of *200 columns and 0.2M records* in a cluster of *1 master
18 GB, 2 slaves 32 GB each, **16 cores/slave*, took around *772 minutes*
for a *very large ML tuning based job* (training).

Now, my requirement is to run the *same operation on 3M records*. Any idea
on how we should proceed? Should we go for a vertical scaling or a
horizontal one? How should this problem be approached in a
stepwise/systematic manner?

Thanks in advance.

Regards,
Aakash.

Re: Data growth vs Cluster Size planning

Posted by Phillip Henry <lo...@gmail.com>.
Too little information to give an answer, if indeed an answer a priori is
possible.

However, I would do the following on your test instances:

- Run jstat -gc on all your nodes. It might be that the GC is taking a lot
of time.

- Poll with jstack semi frequently. I can give you a fairly good idea where
in the code the time is being spent in a non-invasive manner.

Phillip



On Mon, Feb 11, 2019 at 9:48 AM Aakash Basu <aa...@gmail.com>
wrote:

> Hi,
>
> I ran a dataset of *200 columns and 0.2M records* in a cluster of *1
> master 18 GB, 2 slaves 32 GB each, **16 cores/slave*, took around *772
> minutes* for a *very large ML tuning based job* (training).
>
> Now, my requirement is to run the *same operation on 3M records*. Any
> idea on how we should proceed? Should we go for a vertical scaling or a
> horizontal one? How should this problem be approached in a
> stepwise/systematic manner?
>
> Thanks in advance.
>
> Regards,
> Aakash.
>