You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Stephen Boesch <ja...@gmail.com> on 2015/03/23 18:22:20 UTC

SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

Is there a way to take advantage of the underlying datasource partitions
when generating a DataFrame/SchemaRDD via catalyst?  It seems from the sql
module that the only options are RangePartitioner and HashPartitioner - and
further that those are selected automatically by the code .  It was not
apparent that either the underlying partitioning were translated to the
partitions presented in the rdd or that a custom partitioner were possible
to be provided.

The motivation would be to subsequently use df.map (with
preservesPartitioning=true) and/or df.mapPartitions (likewise) to perform
operations that work within the original datasource partitions - thus
avoiding a shuffle.

Re: SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

Posted by Michael Armbrust <mi...@databricks.com>.

There is not an interface to this at this time, and in general I'm hesitant
to open up interfaces where the user could make a mistake where they think
something is going to improve performance but will actually impact
correctness.  Since, as you say, we are picking the partitioner
automatically in the query planner its much harder to know if you are
actually going to preserve the expected partitioning.

Additionally, its also a little more complicated when reading in data as
even if you have files that are partitioned correctly, the InputFormat is
free to split those files, violating our assumptions about partitioning.

On Mon, Mar 23, 2015 at 10:22 AM, Stephen Boesch <ja...@gmail.com> wrote:

>
> Is there a way to take advantage of the underlying datasource partitions
> when generating a DataFrame/SchemaRDD via catalyst?  It seems from the sql
> module that the only options are RangePartitioner and HashPartitioner - and
> further that those are selected automatically by the code .  It was not
> apparent that either the underlying partitioning were translated to the
> partitions presented in the rdd or that a custom partitioner were possible
> to be provided.
>
> The motivation would be to subsequently use df.map (with
> preservesPartitioning=true) and/or df.mapPartitions (likewise) to perform
> operations that work within the original datasource partitions - thus
> avoiding a shuffle.
>
>
>
>
>
>
>