You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Christopher Piggott <cp...@gmail.com> on 2017/12/16 02:33:01 UTC

NASA CDF files in Spark

I'm looking to run a job that involves a zillion files in a format called
CDF, a nasa standard.  There are a number of libraries out there that can
read CDFs but most of them are not high quality compared to the official
NASA one, which has java bindings (via JNI).  It's a little clumsy but I
have it working fairly well in Scala.

The way I was planning on distributing work was with
SparkContext.binaryFIles("hdfs://somepath/*) but that's really sending in
an RDD of byte[] and unfortunately the CDF library doesn't support any kind
of array or stream as input.  The reason is that CDF is really looking for
a random-access file, for performance reasons.

Whats worse, all this code is implemented down at the native layer, in C.

I think my best choice here is to distribute the job using .binaryFiles()
but then have the first task of the worker be to write all those bytes to a
ramdisk file (or maybe a real file, we'll see)... then have the CDF library
open it as if it were a local file.  This seems clumsy and awful but I
haven't come up with any other good ideas.

Has anybody else worked with these files and have a better idea?  Some info
on the library that parses all this:

https://cdf.gsfc.nasa.gov/html/cdf_docs.html


--Chris

Re: NASA CDF files in Spark

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
There is also this project

https://github.com/SciSpark/SciSpark

It might be of interest to you Christopher.

2017-12-16 3:46 GMT-05:00 Jörn Franke <jo...@gmail.com>:

> Develop your own HadoopFileFormat and use https://spark.apache.org/
> docs/2.0.2/api/java/org/apache/spark/SparkContext.
> html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.
> Class,%20java.lang.Class,%20java.lang.Class) to load. The Spark
> datasource API will be relevant for you in the upcoming version 2 as an
> alternative.
>
> On 16. Dec 2017, at 03:33, Christopher Piggott <cp...@gmail.com> wrote:
>
> I'm looking to run a job that involves a zillion files in a format called
> CDF, a nasa standard.  There are a number of libraries out there that can
> read CDFs but most of them are not high quality compared to the official
> NASA one, which has java bindings (via JNI).  It's a little clumsy but I
> have it working fairly well in Scala.
>
> The way I was planning on distributing work was with
> SparkContext.binaryFIles("hdfs://somepath/*) but that's really sending in
> an RDD of byte[] and unfortunately the CDF library doesn't support any kind
> of array or stream as input.  The reason is that CDF is really looking for
> a random-access file, for performance reasons.
>
> Whats worse, all this code is implemented down at the native layer, in C.
>
> I think my best choice here is to distribute the job using .binaryFiles()
> but then have the first task of the worker be to write all those bytes to a
> ramdisk file (or maybe a real file, we'll see)... then have the CDF library
> open it as if it were a local file.  This seems clumsy and awful but I
> haven't come up with any other good ideas.
>
> Has anybody else worked with these files and have a better idea?  Some
> info on the library that parses all this:
>
> https://cdf.gsfc.nasa.gov/html/cdf_docs.html
>
>
> --Chris
>
>

Re: NASA CDF files in Spark

Posted by Jörn Franke <jo...@gmail.com>.
Develop your own HadoopFileFormat and use https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/SparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class) to load. The Spark datasource API will be relevant for you in the upcoming version 2 as an alternative.

> On 16. Dec 2017, at 03:33, Christopher Piggott <cp...@gmail.com> wrote:
> 
> I'm looking to run a job that involves a zillion files in a format called CDF, a nasa standard.  There are a number of libraries out there that can read CDFs but most of them are not high quality compared to the official NASA one, which has java bindings (via JNI).  It's a little clumsy but I have it working fairly well in Scala.
> 
> The way I was planning on distributing work was with SparkContext.binaryFIles("hdfs://somepath/*) but that's really sending in an RDD of byte[] and unfortunately the CDF library doesn't support any kind of array or stream as input.  The reason is that CDF is really looking for a random-access file, for performance reasons.
> 
> Whats worse, all this code is implemented down at the native layer, in C.
> 
> I think my best choice here is to distribute the job using .binaryFiles() but then have the first task of the worker be to write all those bytes to a ramdisk file (or maybe a real file, we'll see)... then have the CDF library open it as if it were a local file.  This seems clumsy and awful but I haven't come up with any other good ideas.
> 
> Has anybody else worked with these files and have a better idea?  Some info on the library that parses all this:
> 
> https://cdf.gsfc.nasa.gov/html/cdf_docs.html
> 
> 
> --Chris
>