You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by J <jo...@gmail.com> on 2014/11/17 04:34:39 UTC

Load json format dataset as RDD

Hi,

I am new to spark. I met a problem when I intended to load one dataset.

I have a dataset where the data is in json format and I'd like to load it
as a RDD.

As one record may span multiple lines, so SparkContext.textFile() is not
doable. I also tried to use json4s to parse the json manually and then
merge them into RDD one by one, but this solution is not convenient and low
efficient.

It seems that there is JsonRDD in SparkSQL, but it seems that it is for
query only.

Could any one provide me some suggestion about how to load json format data
as RDD? For example, given the file path, load the dataset as RDD[JObject].

Thank you very much!

Regards,
J

Re: Load json format dataset as RDD

Posted by Cheng Lian <li...@gmail.com>.

|SQLContext.jsonFile| assumes one JSON record per line. Although I 
haven’t tried yet, it seems that this |JsonInputFormat| [1] can be 
helpful. You may read your original data set with 
|SparkContext.hadoopFile| and |JsonInputFormat|, then transform the 
resulted |RDD[String]| into a |JsonRDD| via |SQLContext.jsonRDD|.

[1] 
http://pivotal-field-engineering.github.io/pmr-common/pmr/apidocs/com/gopivotal/mapreduce/lib/input/JsonInputFormat.html

On 11/17/14 11:34 AM, J wrote:

> Hi,
>
> I am new to spark. I met a problem when I intended to load one dataset.
>
> I have a dataset where the data is in json format and I'd like to load 
> it as a RDD.
>
> As one record may span multiple lines, so SparkContext.textFile() is 
> not doable. I also tried to use json4s to parse the json manually and 
> then merge them into RDD one by one, but this solution is not 
> convenient and low efficient.
>
> It seems that there is JsonRDD in SparkSQL, but it seems that it is 
> for query only.
>
> Could any one provide me some suggestion about how to load json format 
> data as RDD? For example, given the file path, load the dataset as 
> RDD[JObject].
>
> Thank you very much!
>
> Regards,
> J

Re: Load json format dataset as RDD

Posted by Matei Zaharia <ma...@gmail.com>.

Spark SQL gives you an RDD of Row objects that you can query similarly to most JSON object libraries. For example, you can use row(0) to access feature 0, then cast it to something like a String, an Int, a Seq, or another Row if it's a nested object. You can also select the fields you want using SQL syntax and work just with those if you have nested fields (e.g. "select name, location.x, location.y from dataset"). Now you'll get Rows with just those fields.

Matei

> On Nov 16, 2014, at 7:34 PM, J <jo...@gmail.com> wrote:
> 
> Hi, 
> 
> I am new to spark. I met a problem when I intended to load one dataset.
> 
> I have a dataset where the data is in json format and I'd like to load it as a RDD.
> 
> As one record may span multiple lines, so SparkContext.textFile() is not doable. I also tried to use json4s to parse the json manually and then merge them into RDD one by one, but this solution is not convenient and low efficient.
> 
> It seems that there is JsonRDD in SparkSQL, but it seems that it is for query only.
> 
> Could any one provide me some suggestion about how to load json format data as RDD? For example, given the file path, load the dataset as RDD[JObject].
> 
> Thank you very much!
> 
> Regards,
> J

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org