You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Rakesh Nair <ra...@gmail.com> on 2014/12/11 05:50:08 UTC

Compare performance of sqlContext.jsonFile and sqlContext.jsonRDD

Couple of questions :
1. "sqlContext.jsonFile" reads a json file, infers the schema for the data
stored, and then returns a SchemaRDD. Now, i could also create a SchemaRDD
by reading a file as text(which returns RDD[String]) and then use the
"jsonRDD" method. My question, is the "jsonFile" way of creating SchemaRDD
slower than the second method i mentioned (maybe because jsonFile needs to
infer the schema and jsonRDD just applies the schema to a dataset???)

 The workflow i am thinking of is: 1. For the first data set use "jsonFile"
and infer the schema. 2. Save the schema somewhere. 3. For later data sets,
create RDD[String] and then use "jsonRDD" method to convert the RDD[String]
to SchemaRDD.


2. What is the best way to store a schema or rather how can i serialize
StructType and store it in hdfs, so that i can load it later.

-- 
Regards
Rakesh Nair

Re: Compare performance of sqlContext.jsonFile and sqlContext.jsonRDD

Posted by Cheng Lian <li...@gmail.com>.
There are several overloaded versions of both |jsonFile| and |jsonRDD|. 
Schema inferring is kinda expensive since it requires an extra Spark 
job. You can avoid schema inferring by storing the inferred schema and 
then use it together with the following two methods:

  * |def jsonFile(path: String, schema: StructType): SchemaRDD|
  * |def jsonRDD(json: RDD[String], schema: StructType): SchemaRDD|

You can use |StructType.json|/|StructType.prettyJson| and 
|DataType.fromJson| to store and load the schema.

Cheng

On 12/11/14 12:50 PM, Rakesh Nair wrote:

> Couple of questions :
> 1. "sqlContext.jsonFile" reads a json file, infers the schema for the 
> data stored, and then returns a SchemaRDD. Now, i could also create a 
> SchemaRDD by reading a file as text(which returns RDD[String]) and 
> then use the "jsonRDD" method. My question, is the "jsonFile" way of 
> creating SchemaRDD slower than the second method i mentioned (maybe 
> because jsonFile needs to infer the schema and jsonRDD just applies 
> the schema to a dataset???)
>
>  The workflow i am thinking of is: 1. For the first data set use 
> "jsonFile" and infer the schema. 2. Save the schema somewhere. 3. For 
> later data sets, create RDD[String] and then use "jsonRDD" method to 
> convert the RDD[String] to SchemaRDD.
>
>
> 2. What is the best way to store a schema or rather how can i 
> serialize StructType and store it in hdfs, so that i can load it later.
>
> -- 
> Regards
> Rakesh Nair

​