You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sa...@wellsfargo.com on 2015/10/16 23:08:34 UTC

How to speed up reading from file?

Hello,

Is there an optimal number of partitions per number of rows, when writing into disk, so we can re-read later from source in a distributed way?
Any  thoughts?

Thanks
Saif


Re: How to speed up reading from file?

Posted by Xiao Li <ga...@gmail.com>.
Hi, Saif,

The optimal number of rows per partition depends on many factors, right?
for example, your row size, your file system configuration, your
replication configuration and the performance of your underlying hardware.
The best way is to do the performance testing and tuning your
configurations. Generally, if each batch contains just a few MB, the
performance is bad compared with a bigger batch.

Check the following paper regarding the performance of Spark and MR,
http://www.vldb.org/pvldb/vol8/p2110-shi.pdf. It might help you understand
your use case. For example, caching can be used in your system.

Good luck,

Xiao Li

2015-10-16 14:08 GMT-07:00 <Sa...@wellsfargo.com>:

> Hello,
>
> Is there an optimal number of partitions per number of rows, when writing
> into disk, so we can re-read later from source in a distributed way?
> Any  thoughts?
>
> Thanks
> Saif
>
>