You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Corey Nolet <cj...@gmail.com> on 2014/11/12 23:05:10 UTC

Spark SQL Lazy Schema Evaluation

I'm loading sequence files containing json blobs in the value, transforming
them into RDD[String] and then using hiveContext.jsonRDD(). It looks like
Spark reads the files twice- once when I I define the jsonRDD() and then
again when I actually make my call to hiveContext.sql().

Looking @ the code- I see an inferSchema() method which gets called under
the hood. I also see an experimental jsonRDD() method which has a
sampleRatio.

My dataset is extremely large and i've got a lot of processing to do on it-
it's really not a luxury to be able to loop through it twice. I also know
that the SQL I am going to be running matches at least "some" of the
records contained in the files. Would it make sense or be possible with the
current execution plan design to be able to bypass inferring the schema for
purposes of speed?

Though I haven't really dug further in the code than the implementations of
the client API methods that I'm calling, I am wondering if there's a way to
theoretically process the data without pre-determining the schema. I also
don't have the luxury of giving the full schema ahead of time because i may
want to do a "select * from table" but I may only know 2 or 3 of the actual
json keys that are available.

Thanks.

Re: Spark SQL Lazy Schema Evaluation

Posted by Michael Armbrust <mi...@databricks.com>.
There are a few things you can do here:

 - Infer the schema on a subset of the data, pass that inferred schema
(schemaRDD.schema) as the second argument of jsonRDD.
 - Hand construct a schema and pass it as the second argument including the
fields you are interested in.
 - Instead load the data as a table with a single string and use Hive UDFs
to extract the fields you want.

Michael

On Wed, Nov 12, 2014 at 2:05 PM, Corey Nolet <cj...@gmail.com> wrote:

> I'm loading sequence files containing json blobs in the value,
> transforming them into RDD[String] and then using hiveContext.jsonRDD(). It
> looks like Spark reads the files twice- once when I I define the jsonRDD()
> and then again when I actually make my call to hiveContext.sql().
>
> Looking @ the code- I see an inferSchema() method which gets called under
> the hood. I also see an experimental jsonRDD() method which has a
> sampleRatio.
>
> My dataset is extremely large and i've got a lot of processing to do on
> it- it's really not a luxury to be able to loop through it twice. I also
> know that the SQL I am going to be running matches at least "some" of the
> records contained in the files. Would it make sense or be possible with the
> current execution plan design to be able to bypass inferring the schema for
> purposes of speed?
>
> Though I haven't really dug further in the code than the implementations
> of the client API methods that I'm calling, I am wondering if there's a way
> to theoretically process the data without pre-determining the schema. I
> also don't have the luxury of giving the full schema ahead of time because
> i may want to do a "select * from table" but I may only know 2 or 3 of the
> actual json keys that are available.
>
> Thanks.
>