You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Nezih <ny...@netflix.com> on 2015/11/23 22:24:01 UTC

question about combining small input splits

Hi Spark Devs,
I tried getting an answer to my question in the user mailing list, but so
far couldn't. That's why I wanted to try the dev mailing list too in case
someone can help me.

I have a Hive table that has a lot of small parquet files and I am creating
a data frame out of it to do some processing, but since I have a large
number of splits/files my job creates a lot of tasks, which I don't want.
Basically what I want is the same functionality that Hive provides, that is,
to combine these small input splits into larger ones by specifying a max
split size setting. Is this currently possible with Spark?

I look at coalesce() but with coalesce I can only control the number of
output files not their sizes. And since the total input dataset size can
vary significantly in my case, I cannot just use a fixed partition count as
the size of each output file can get very large. I then looked for getting
the total input size from an rdd to come up with some heuristic to set the
partition count, but I couldn't find any ways to do it.

Any help is appreciated.

Thanks,

Nezih



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/question-about-combining-small-input-splits-tp15324.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org