You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Singh <mo...@gmail.com> on 2014/07/29 06:05:26 UTC

Reading hdf5 formats with pyspark

Hi,
   We have setup spark on a HPC system and are trying to implement some
data pipeline and algorithms in place.
The input data is in hdf5 (these are very high resolution brain images) and
it can be read via h5py library in python. So, my current approach (which
seems to be working ) is writing a function
def process(filename):
   #logic

and then execute via
files = [list of filenames]
sc.parallelize(files).foreach(process)

Is this the right approach??
-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates

Re: Reading hdf5 formats with pyspark

Posted by Xiangrui Meng <me...@gmail.com>.

That looks good to me since there is no Hadoop InputFormat for HDF5.
But remember to specify the number of partitions in sc.parallelize to
use all the nodes. You can change `process` to `read` which yields
records one-by-one. Then sc.parallelize(files,
numPartitions).flatMap(read) returns an RDD of records and you can use
it as the start of your pipeline. -Xiangrui

On Mon, Jul 28, 2014 at 9:05 PM, Mohit Singh <mo...@gmail.com> wrote:
> Hi,
>    We have setup spark on a HPC system and are trying to implement some data
> pipeline and algorithms in place.
> The input data is in hdf5 (these are very high resolution brain images) and
> it can be read via h5py library in python. So, my current approach (which
> seems to be working ) is writing a function
> def process(filename):
>    #logic
>
> and then execute via
> files = [list of filenames]
> sc.parallelize(files).foreach(process)
>
> Is this the right approach??
> --
> Mohit
>
> "When you want success as badly as you want the air, then you will get it.
> There is no other secret of success."
> -Socrates