You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sa...@wellsfargo.com on 2016/06/03 15:31:02 UTC

Strategies for propery load-balanced partitioning

Hello everyone!

I was noticing that, when reading parquet files or actually any kind of source data frame data (spark-csv, etc), default partinioning is not fair.
Action tasks usually act very fast on some partitions and very slow on some others, and frequently, even fast on all but last partition (which looks like it reads +50% of the data input size).

I notice that each task is loading some portion of the data, say 1024MB chunks, and some task loading 20+GB of data.

Applying repartition strategies solve this issue properly and general performance is increased considerably, but for very large dataframes, repartitioning is a costly process.

In short, what are the available strategies or configurations that help reading from disk or hdfs with proper executor-data-distribution??

If this needs to be more specific, I am strictly focused on PARQUET files rom HDFS. I know there are some MIN

Really appreciate,
Saif


Re: Strategies for propery load-balanced partitioning

Posted by Takeshi Yamamuro <li...@gmail.com>.
Hi,

you can control this kinda issue in the comming v2.0.
See https://www.mail-archive.com/user@spark.apache.org/msg51603.html

// maropu


On Sat, Jun 4, 2016 at 10:23 AM, Silvio Fiorito <
silvio.fiorito@granturing.com> wrote:

> Hi Saif!
>
>
>
> When you say this happens with spark-csv, are the files gzipped by any
> chance? GZip is non-splittable so if you’re seeing skew simply from loading
> data it could be you have some extremely large gzip files. So for a single
> stage job you will have those tasks lagging compared to the smaller gzips.
> As you already said, the option there would be to repartition at the
> expense of shuffling. If you’re seeing this with parquet files, what do the
> individual part-* files look like (size, compression type, etc.)?
>
>
>
> Thanks,
>
> Silvio
>
>
>
> *From: *"Saif.A.Ellafi@wellsfargo.com" <Sa...@wellsfargo.com>
> *Date: *Friday, June 3, 2016 at 8:31 AM
> *To: *"user@spark.apache.org" <us...@spark.apache.org>
> *Subject: *Strategies for propery load-balanced partitioning
>
>
>
> Hello everyone!
>
>
>
> I was noticing that, when reading parquet files or actually any kind of
> source data frame data (spark-csv, etc), default partinioning is not fair.
>
> Action tasks usually act very fast on some partitions and very slow on
> some others, and frequently, even fast on all but last partition (which
> looks like it reads +50% of the data input size).
>
>
>
> I notice that each task is loading some portion of the data, say 1024MB
> chunks, and some task loading 20+GB of data.
>
>
>
> Applying repartition strategies solve this issue properly and general
> performance is increased considerably, but for very large dataframes,
> repartitioning is a costly process.
>
>
>
> In short, what are the available strategies or configurations that help
> reading from disk or hdfs with proper executor-data-distribution??
>
>
>
> If this needs to be more specific, I am strictly focused on PARQUET files
> rom HDFS. I know there are some MIN
>
>
>
> Really appreciate,
>
> Saif
>
>
>



-- 
---
Takeshi Yamamuro

Re: Strategies for propery load-balanced partitioning

Posted by Silvio Fiorito <si...@granturing.com>.
Hi Saif!

When you say this happens with spark-csv, are the files gzipped by any chance? GZip is non-splittable so if you’re seeing skew simply from loading data it could be you have some extremely large gzip files. So for a single stage job you will have those tasks lagging compared to the smaller gzips. As you already said, the option there would be to repartition at the expense of shuffling. If you’re seeing this with parquet files, what do the individual part-* files look like (size, compression type, etc.)?

Thanks,
Silvio

From: "Saif.A.Ellafi@wellsfargo.com" <Sa...@wellsfargo.com>
Date: Friday, June 3, 2016 at 8:31 AM
To: "user@spark.apache.org" <us...@spark.apache.org>
Subject: Strategies for propery load-balanced partitioning

Hello everyone!

I was noticing that, when reading parquet files or actually any kind of source data frame data (spark-csv, etc), default partinioning is not fair.
Action tasks usually act very fast on some partitions and very slow on some others, and frequently, even fast on all but last partition (which looks like it reads +50% of the data input size).

I notice that each task is loading some portion of the data, say 1024MB chunks, and some task loading 20+GB of data.

Applying repartition strategies solve this issue properly and general performance is increased considerably, but for very large dataframes, repartitioning is a costly process.

In short, what are the available strategies or configurations that help reading from disk or hdfs with proper executor-data-distribution??

If this needs to be more specific, I am strictly focused on PARQUET files rom HDFS. I know there are some MIN

Really appreciate,
Saif


RE: Strategies for propery load-balanced partitioning

Posted by Sa...@wellsfargo.com.
Appreciate the follow up.

I am not entirely sure how or why my question is related to bucketization capabilities. It indeeds sounds like a powerful feature to avoid shuffling, but in my case, I am referring to straight forward processes of reading data and writing to parquet.
If bucket tables allow to setup on pre-reading time buckets and specify parallelization when directly writing, then you hit on the nail.

My problem is that reading from source (usually hundreds of text files) turn in into 10k+ partition dataframes, based on the partition's block size and  number of data splits, writing these back are a huge overhead for parquet and require repartitioning in order to reduce heap memory usage, specially on wide tables.

Let see how it goes.
Saif


From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.marcu@inria.fr]
Sent: Friday, June 03, 2016 2:55 PM
To: Ellafi, Saif A.
Cc: user; Reynold Xin; michael@databricks.com
Subject: Re: Strategies for propery load-balanced partitioning

I suppose you are running on 1.6.
I guess you need some solution based on [1], [2] features which are coming in 2.0.

[1] https://issues.apache.org/jira/browse/SPARK-12538 / https://issues.apache.org/jira/browse/SPARK-12394
[2] https://issues.apache.org/jira/browse/SPARK-12849

However, I did not check for examples, I would like to add to your question and ask the community to link to some examples with the recent improvements/changes.

It could help however to give concrete example on your specific problem, as you may hit some stragglers also probably caused by data skew.

Best,
Ovidiu


On 03 Jun 2016, at 17:31, Saif.A.Ellafi@wellsfargo.com<ma...@wellsfargo.com> wrote:

Hello everyone!

I was noticing that, when reading parquet files or actually any kind of source data frame data (spark-csv, etc), default partinioning is not fair.
Action tasks usually act very fast on some partitions and very slow on some others, and frequently, even fast on all but last partition (which looks like it reads +50% of the data input size).

I notice that each task is loading some portion of the data, say 1024MB chunks, and some task loading 20+GB of data.

Applying repartition strategies solve this issue properly and general performance is increased considerably, but for very large dataframes, repartitioning is a costly process.

In short, what are the available strategies or configurations that help reading from disk or hdfs with proper executor-data-distribution??

If this needs to be more specific, I am strictly focused on PARQUET files rom HDFS. I know there are some MIN

Really appreciate,
Saif


Re: Strategies for propery load-balanced partitioning

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.
I suppose you are running on 1.6.
I guess you need some solution based on [1], [2] features which are coming in 2.0.

[1] https://issues.apache.org/jira/browse/SPARK-12538 <https://issues.apache.org/jira/browse/SPARK-12538> / https://issues.apache.org/jira/browse/SPARK-12394 <https://issues.apache.org/jira/browse/SPARK-12394>
[2] https://issues.apache.org/jira/browse/SPARK-12849 <https://issues.apache.org/jira/browse/SPARK-12849>

However, I did not check for examples, I would like to add to your question and ask the community to link to some examples with the recent improvements/changes.

It could help however to give concrete example on your specific problem, as you may hit some stragglers also probably caused by data skew.

Best,
Ovidiu


> On 03 Jun 2016, at 17:31, Saif.A.Ellafi@wellsfargo.com wrote:
> 
> Hello everyone!
>  
> I was noticing that, when reading parquet files or actually any kind of source data frame data (spark-csv, etc), default partinioning is not fair.
> Action tasks usually act very fast on some partitions and very slow on some others, and frequently, even fast on all but last partition (which looks like it reads +50% of the data input size).
>  
> I notice that each task is loading some portion of the data, say 1024MB chunks, and some task loading 20+GB of data.
>  
> Applying repartition strategies solve this issue properly and general performance is increased considerably, but for very large dataframes, repartitioning is a costly process.
>  
> In short, what are the available strategies or configurations that help reading from disk or hdfs with proper executor-data-distribution??
>  
> If this needs to be more specific, I am strictly focused on PARQUET files rom HDFS. I know there are some MIN
>  
> Really appreciate,
> Saif