You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Selvam Raman <se...@gmail.com> on 2016/10/20 18:42:05 UTC

Spark SQL parallelize

Hi,

I am having 40+ structured data stored in s3 bucket as parquet file .

I am going to use 20 table in the use case.

There s a Main table which drive the whole flow. Main table contains 1k
record.

My use case is for every record in the main table process the rest of
table( join group by depends on main table field).

How can I parallel the process.

What I done was read the main table and create tocaliterator for df then do
the rest of the processing.
This one run one by one record.

Please share me your ideas.

Thank you.