You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dhruv Kumar <dh...@umn.edu.INVALID> on 2021/04/10 08:49:58 UTC

Automated setup of a multi-node cluster for Apache Spark and generation of profiling results

Hello

I am new to Apache Spark and am looking for some close guidance or collaboration for my Spark Project which has the following main components:

1. Writing scripts for automated setup of a multi-node cluster for Apache Spark with Hadoop File System (HDFS). This is required since I don’t have a fixed set of machines to run my Spark experiments and hence, need an easy, quick and automated way to do the entire Spark setup.

2. Writing scripts for simple SQL queries which read input from HDFS, run the SQL queries on the multi-node spark cluster and store the output in HDFS.

3. Generating detailed profiling results such as latency, shuffled data size for every task/operator in the SQL query and generating graphs for the same.

Happy to discuss in more detail.

Thanks
Dhruv
dhruv@umn.edu <ma...@umn.edu>

--------------------------------------------------
Dhruv Kumar
PhD Candidate
Computer Science and Engineering
University of Minnesota
www.dhruvkumar.me <http://dhruvkumar.me/>