You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Supun Kamburugamuve <su...@gmail.com> on 2019/07/15 15:45:15 UTC

Sorting tuples with byte key and byte value

Hi all,

We are trying to measure the sorting performance of Spark. We have a 16
node cluster with 48 cores and 256GB of ram in each machine and 10Gbps
network.

Let's say we are running with 128 parallel tasks and each partition
generates about 1GB of data (total 128GB).

We are using the method *repartitionAndSortWithinPartitions*

A standalone cluster is used with the following configuration.

SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=16G
SPARK_WORKER_INSTANCES=8

--executor-memory 16G --executor-cores 1 --num-executors 128

I believe this sets 128 executors to run the job each having 16GB of memory
and spread across 16 nodes with 8 threads in each node. This configuration
runs very slow. The program doesn't use disks to read or write data (data
generated in-memory and we don't write to file after sorting).

It seems even though the data size is small, it uses disk for the shuffle.
We are not sure our configurations are optimal to achieve the best
performance.

Best,
Supun..

Re: Sorting tuples with byte key and byte value

Posted by Supun Kamburugamuve <su...@gmail.com>.

Thanks, Keith. we have set the SPARK_WORKER_INSTANCES=8. So that means we
are running 8 workers in a single machine with 1 thread and this gives the
8 threads?

Is there a preference for running 1 worker and 8 threads inside it? These
are dual CPU machines, so I believe we at least need 2 worker instances per
machine. If this is the case, I can use 2 worker instances each having 4
threads.

Another question is how to avoid the disk for shuffle operation?

Best,
Supun..





On Mon, Jul 15, 2019 at 8:49 PM Keith Chapman <ke...@gmail.com>
wrote:

> Hi Supun,
>
> A couple of things with regard to your question.
>
> --executor-cores means the number of worker threads per VM. According to
> your requirement this should be set to 8.
>
> *repartitionAndSortWithinPartitions *is a RDD operation, RDD operations
> in Spark are not performant both in terms of execution and memory. I would
> rather use Dataframe sort operation if performance is key.
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
>
> On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve <
> supun.kamburugamuve@gmail.com> wrote:
>
>> Hi all,
>>
>> We are trying to measure the sorting performance of Spark. We have a 16
>> node cluster with 48 cores and 256GB of ram in each machine and 10Gbps
>> network.
>>
>> Let's say we are running with 128 parallel tasks and each partition
>> generates about 1GB of data (total 128GB).
>>
>> We are using the method *repartitionAndSortWithinPartitions*
>>
>> A standalone cluster is used with the following configuration.
>>
>> SPARK_WORKER_CORES=1
>> SPARK_WORKER_MEMORY=16G
>> SPARK_WORKER_INSTANCES=8
>>
>> --executor-memory 16G --executor-cores 1 --num-executors 128
>>
>> I believe this sets 128 executors to run the job each having 16GB of
>> memory and spread across 16 nodes with 8 threads in each node. This
>> configuration runs very slow. The program doesn't use disks to read or
>> write data (data generated in-memory and we don't write to file after
>> sorting).
>>
>> It seems even though the data size is small, it uses disk for the
>> shuffle. We are not sure our configurations are optimal to achieve the best
>> performance.
>>
>> Best,
>> Supun..
>>
>>

Re: Sorting tuples with byte key and byte value

Posted by Keith Chapman <ke...@gmail.com>.

Hi Supun,

A couple of things with regard to your question.

--executor-cores means the number of worker threads per VM. According to
your requirement this should be set to 8.

*repartitionAndSortWithinPartitions *is a RDD operation, RDD operations in
Spark are not performant both in terms of execution and memory. I would
rather use Dataframe sort operation if performance is key.

Regards,
Keith.

http://keith-chapman.com


On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve <
supun.kamburugamuve@gmail.com> wrote:

> Hi all,
>
> We are trying to measure the sorting performance of Spark. We have a 16
> node cluster with 48 cores and 256GB of ram in each machine and 10Gbps
> network.
>
> Let's say we are running with 128 parallel tasks and each partition
> generates about 1GB of data (total 128GB).
>
> We are using the method *repartitionAndSortWithinPartitions*
>
> A standalone cluster is used with the following configuration.
>
> SPARK_WORKER_CORES=1
> SPARK_WORKER_MEMORY=16G
> SPARK_WORKER_INSTANCES=8
>
> --executor-memory 16G --executor-cores 1 --num-executors 128
>
> I believe this sets 128 executors to run the job each having 16GB of
> memory and spread across 16 nodes with 8 threads in each node. This
> configuration runs very slow. The program doesn't use disks to read or
> write data (data generated in-memory and we don't write to file after
> sorting).
>
> It seems even though the data size is small, it uses disk for the shuffle.
> We are not sure our configurations are optimal to achieve the best
> performance.
>
> Best,
> Supun..
>
>