You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Peter Halliday <pj...@cornell.edu> on 2016/06/13 20:04:30 UTC

how to investigate skew and DataFrames and RangePartitioner

I have two questions

First,I have a failure when I write parquet from Spark 1.6.1 on Amazon EMR to S3.  This is full batch, which is over 200GB of source data.  The partitioning is based on a geographic identifier we use, and also a date we got the data.  However, because of geographical density we certainly could be hitting the fact we are getting tiles too dense.  I’m trying to figure out how to figure out the size of the file it’s trying to write out.

Second, We use to use RDDs and RangePartitioner for task partitioning.  However, I don’t see this available in DataFrames.  How does one achieve this now.

Peter Halliday
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: how to investigate skew and DataFrames and RangePartitioner

Posted by Takeshi Yamamuro <li...@gmail.com>.
Hi,

I'm afraid there is currently no api to define RangeParititoner in df.

// maropu

On Tue, Jun 14, 2016 at 5:04 AM, Peter Halliday <pj...@cornell.edu> wrote:

> I have two questions
>
> First,I have a failure when I write parquet from Spark 1.6.1 on Amazon EMR
> to S3.  This is full batch, which is over 200GB of source data.  The
> partitioning is based on a geographic identifier we use, and also a date we
> got the data.  However, because of geographical density we certainly could
> be hitting the fact we are getting tiles too dense.  I’m trying to figure
> out how to figure out the size of the file it’s trying to write out.
>
> Second, We use to use RDDs and RangePartitioner for task partitioning.
> However, I don’t see this available in DataFrames.  How does one achieve
> this now.
>
> Peter Halliday
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro